Any documentation we can take a look at to diagnose the foll Flyte #flyte-deployment

Any documentation we can take a look at to diagnos...

limited-dog-47035

08/04/2022, 8:34 PM

Any documentation we can take a look at to diagnose the following error the propeller is logging on restart:

"msg":"failed to load plugin - spark: no matches for kind \"SparkApplication\" in version \"<http://sparkoperator.k8s.io/v1beta2|sparkoperator.k8s.io/v1beta2>\"

My propeller config map looks like this:

Copy code

# Source: flyte-core/templates/propeller/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: flyte-propeller-config
  namespace: flyte
  labels: 
    <http://app.kubernetes.io/name|app.kubernetes.io/name>: flyteadmin
data:
  admin.yaml: | 
    admin:
      clientId: 'flytepropeller'
      clientSecretLocation: /etc/secrets/client_secret
      endpoint: flyteadmin:81
      insecure: true
    event:
      capacity: 1000
      rate: 500
      type: admin
  catalog.yaml: | 
    catalog-cache:
      endpoint: datacatalog:89
      insecure: true
      type: datacatalog
  copilot.yaml: | 
    plugins:
      k8s:
        co-pilot:
          image: <http://cr.flyte.org/flyteorg/flytecopilot:v0.0.24|cr.flyte.org/flyteorg/flytecopilot:v0.0.24>
          name: flyte-copilot-
          start-timeout: 30s
  core.yaml: | 
    manager:
      pod-application: flytepropeller
      pod-template-container-name: flytepropeller
      pod-template-name: flytepropeller-template
    propeller:
      downstream-eval-duration: 30s
      enable-admin-launcher: true
      gc-interval: 12h
      kube-client-config:
        burst: 25
        qps: 100
        timeout: 30s
      leader-election:
        enabled: true
        lease-duration: 15s
        lock-config-map:
          name: propeller-leader
          namespace: flyte
        renew-deadline: 10s
        retry-period: 2s
      limit-namespace: all
      max-workflow-retries: 50
      metadata-prefix: metadata/propeller
      metrics-prefix: flyte
      prof-port: 10254
      queue:
        batch-size: -1
        batching-interval: 2s
        queue:
          base-delay: 5s
          capacity: 1000
          max-delay: 120s
          rate: 100
          type: maxof
        sub-queue:
          capacity: 1000
          rate: 100
          type: bucket
        type: batch
      rawoutput-prefix: s3://${ parameters.s3_bucket_name }/
      workers: 40
      workflow-reeval-duration: 30s
    webhook:
      certDir: /etc/webhook/certs
      serviceName: flyte-pod-webhook
  enabled_plugins.yaml: | 
    tasks:
      task-plugins:
        default-for-task-types:
          container: container
          container_array: k8s-array
          sidecar: sidecar
          spark: spark
        enabled-plugins:
        - container
        - sidecar
        - k8s-array
        - spark
  k8s.yaml: | 
    plugins:
      k8s:
        default-cpus: 100m
        default-env-vars: []
        default-memory: 100Mi
  resource_manager.yaml: | 
    propeller:
      resourcemanager:
        type: noop
  storage.yaml: | 
    storage:
      type: s3
      container: "${ parameters.s3_bucket_name }"
      connection:
        auth-type: iam
        region: ${ parameters.aws_region }
      limits:
        maxDownloadMBs: 10
  cache.yaml: |
    cache:
      max_size_mbs: 1024
      target_gc_percent: 70
  task_logs.yaml: | 
    plugins:
      logs:
        cloudwatch-enabled: false
        kubernetes-enabled: false
  spark.yaml: |
    plugins:
      spark:
        spark-config-default:
          - spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: "2"
          - spark.kubernetes.allocation.batch.size: "50"
          - spark.hadoop.fs.s3a.acl.default: "BucketOwnerFullControl"
          - spark.hadoop.fs.s3n.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
          - spark.hadoop.fs.AbstractFileSystem.s3n.impl: "org.apache.hadoop.fs.s3a.S3A"
          - spark.hadoop.fs.s3.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
          - spark.hadoop.fs.AbstractFileSystem.s3.impl: "org.apache.hadoop.fs.s3a.S3A"
          - spark.hadoop.fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
          - spark.hadoop.fs.AbstractFileSystem.s3a.impl: "org.apache.hadoop.fs.s3a.S3A"
          - spark.hadoop.fs.s3a.multipart.threshold: "536870912"
          - spark.blacklist.enabled: "true"
          - spark.blacklist.timeout: "5m"
          - spark.task.maxfailures: "8"

And my cluster resource template

Copy code

# Source: flyte-core/templates/clusterresourcesync/cluster_resource_configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: clusterresource-template
  namespace: flyte
  labels: 
    <http://app.kubernetes.io/name|app.kubernetes.io/name>: flyteadmin
    <http://helm.sh/chart|helm.sh/chart>: flyte-core-v0.1.10
data:
  aa_namespace.yaml: | 
    apiVersion: v1
    kind: Namespace
    metadata:
      name: {{ namespace }}
    spec:
      finalizers:
      - kubernetes
    
  aab_default_service_account.yaml: | 
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: default
      namespace: {{ namespace }}
    
  ab_project_resource_quota.yaml: | 
    apiVersion: v1
    kind: ResourceQuota
    metadata:
      name: project-quota
      namespace: {{ namespace }}
    spec:
      hard:
        limits.cpu: {{ projectQuotaCpu }}
        limits.memory: {{ projectQuotaMemory }}
    
  ac_spark_role.yaml: | 
    apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
    kind: Role
    metadata:
      name: spark-role
      namespace: {{ namespace }}
    rules:
    - apiGroups: ["*"]
      resources:
      - pods
      verbs:
      - '*'
    - apiGroups: ["*"]
      resources:
      - services
      verbs:
      - '*'
    - apiGroups: ["*"]
      resources:
      - configmaps
      verbs:
      - '*'
    
  ad_spark_service_account.yaml: | 
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: spark
      namespace: {{ namespace }}
    
  ae_spark_role_binding.yaml: | 
    apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
    kind: RoleBinding
    metadata:
      name: spark-role-binding
      namespace: {{ namespace }}
    roleRef:
      apiGroup: <http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>
      kind: Role
      name: spark-role
    subjects:
    - kind: ServiceAccount
      name: spark
      namespace: {{ namespace }}

hallowed-mouse-14616

08/04/2022, 8:47 PM

Hey @limited-dog-47035 do you have the spark k8s operator deployed on the cluster? It looks like the spark plugin does not detect the CRD.

limited-dog-47035

08/04/2022, 8:49 PM

Yes, its been deployed on the cluster before Flyte even

limited-dog-47035

08/04/2022, 8:50 PM

I'll double check the namespace its in

limited-dog-47035

08/04/2022, 8:57 PM

hmmm yeah it seems to be there under the

spark-operator

namespace

limited-dog-47035

08/04/2022, 10:27 PM

I got the service to start up, but it still complains:

"No plugin found for Handler-type [spark], defaulting to [container]"

Also

"No plugin found for Handler-type [python-task], defaulting to [container]"

However based on some other slack messages I searched, this seems to be the normal behavior?

hallowed-mouse-14616

08/05/2022, 11:22 AM

So for

python-task

this is normal. Basically, under the plugins configuration there is a mapping of task types to the plugin id. Typically these are the same so it seems a little redundant, example:

Copy code

enabled_plugins.yaml: | 
    tasks:
      task-plugins:
        default-for-task-types:
          container: container
          container_array: k8s-array
          sidecar: sidecar
          spark: spark
        enabled-plugins:
        - container
        - sidecar
        - k8s-array
        - spark

In propeller if we register a task-type that doesn't have an associated plugin it fallsback to the

container

plugin. As I mentioned, this is normal for

python-task

but for

spark

it could be an issue. Did you change this configuration?

hallowed-mouse-14616

08/05/2022, 11:23 AM

cc @freezing-airport-6809 - help in setting up Spark plugin? Does the Spark k8s operator Namespace matter?

freezing-airport-6809

08/05/2022, 2:33 PM

Config seems right

limited-dog-47035

08/05/2022, 3:14 PM

I guess this means that my spark jobs are running via container rather than SparkOperator -- my configs are as you posted my propeller configmap

freezing-airport-6809

08/05/2022, 3:16 PM

@limited-dog-47035 not at my desk, will tal in a bit. @tall-lock-23197 / @great-school-54368 / if you folks are around

great-school-54368

08/05/2022, 3:30 PM

@limited-dog-47035 Can you please post the output of these command ?

Copy code

kubectl api-versions | grep '<http://sparkoperator.k8s.io|sparkoperator.k8s.io>'
kubectl api-resources | grep '<http://sparkoperator.k8s.io|sparkoperator.k8s.io>'

limited-dog-47035

08/05/2022, 3:38 PM

Sure, I don't have kubectl access on our cluster so gotta find an on call engineer but will report back

👍 1

limited-dog-47035

08/05/2022, 3:51 PM

Copy code

$ kubectl --kubeconfig=ss-dev-new1 api-versions | grep '<http://sparkoperator.k8s.io|sparkoperator.k8s.io>'
<http://sparkoperator.k8s.io/v1beta2|sparkoperator.k8s.io/v1beta2>

Copy code

$ kubectl --kubeconfig=ss-dev-new1 api-resources | grep '<http://sparkoperator.k8s.io|sparkoperator.k8s.io>'
scheduledsparkapplications        scheduledsparkapp                    <http://sparkoperator.k8s.io/v1beta2|sparkoperator.k8s.io/v1beta2>           true         ScheduledSparkApplication
sparkapplications                 sparkapp                             <http://sparkoperator.k8s.io/v1beta2|sparkoperator.k8s.io/v1beta2>           true         SparkApplication

great-school-54368

08/05/2022, 6:12 PM

can you also send me your command for spark operator installation ? Also can you restart propeller deployment and send me startup logs. Did you follow the plugin docs https://docs.flyte.org/en/latest/deployment/plugin_setup/k8s/index.html#deployment-plugin-setup-k8s ?

limited-dog-47035

08/05/2022, 7:02 PM

The Spark Operator was already installed in our cluster by dev ops before we installed Flyte, I can ask how they installed it; we did follow the k8s plugin instructions to update the helm chart; we then used the chart to generate the updates to configmaps for propeller (above), admin custer role, and resource template (above) manifests

helm template flyte flyteorg/flyte-core -f <https://raw.githubusercontent.com/flyteorg/flyte/master/charts/flyte-core/values-sandbox.yaml> -f values-override.yaml -n flyte > spark-override.yaml

limited-dog-47035

08/05/2022, 7:09 PM

logs:

limited-dog-47035

08/05/2022, 7:09 PM

Copy code

time="2022-08-05T19:03:52Z" level=info msg=------------------------------------------------------------------------
time="2022-08-05T19:03:52Z" level=info msg="App [flytepropeller], Version [unknown], BuildSHA [unknown], BuildTS [2022-08-05 19:03:52.575011998 +0000 UTC m=+0.049406538]"
time="2022-08-05T19:03:52Z" level=info msg=------------------------------------------------------------------------
time="2022-08-05T19:03:52Z" level=info msg="Detected: 8 CPU's\n"
{"json":{},"level":"error","msg":"failed to initialize token source provider. Err: failed to fetch auth metadata. Error: rpc error: code = Unimplemented desc = unknown service flyteidl.service.AuthMetadataService","ts":"2022-08-05T19:03:53Z"}
{"json":{},"level":"warning","msg":"Starting an unauthenticated client because: can't create authenticated channel without a TokenSourceProvider","ts":"2022-08-05T19:03:53Z"}
{"json":{},"level":"error","msg":"failed to initialize token source provider. Err: failed to fetch auth metadata. Error: rpc error: code = Unimplemented desc = unknown service flyteidl.service.AuthMetadataService","ts":"2022-08-05T19:03:53Z"}
{"json":{},"level":"warning","msg":"Starting an unauthenticated client because: can't create authenticated channel without a TokenSourceProvider","ts":"2022-08-05T19:03:53Z"}
{"json":{},"level":"warning","msg":"defaulting max ttl for workflows to 23 hours, since configured duration is larger than 23 [23]","ts":"2022-08-05T19:03:53Z"}
{"json":{},"level":"warning","msg":"stow configuration section missing, defaulting to legacy s3/minio connection config","ts":"2022-08-05T19:03:53Z"}
I0805 19:03:53.592186       1 leaderelection.go:243] attempting to acquire leader lease flyte/propeller-leader...
I0805 19:04:10.415449       1 leaderelection.go:253] successfully acquired lease flyte/propeller-leader

freezing-airport-6809

08/05/2022, 9:56 PM

@limited-dog-47035 did you get it working?

freezing-airport-6809

08/05/2022, 9:56 PM

if you are around I can help

freezing-airport-6809

08/05/2022, 10:01 PM

@great-school-54368 I do not think it is a spark operator problem. For some reason the problem seems to be that the plugin is not getting registtered in the system

173 Views

Open in Slack

Previous Next