Any documentation we can take a look at to diagnos...
# flyte-deployment
k
Any documentation we can take a look at to diagnose the following error the propeller is logging on restart:
"msg":"failed to load plugin - spark: no matches for kind \"SparkApplication\" in version \"<http://sparkoperator.k8s.io/v1beta2|sparkoperator.k8s.io/v1beta2>\"
My propeller config map looks like this:
Copy code
# Source: flyte-core/templates/propeller/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: flyte-propeller-config
  namespace: flyte
  labels: 
    <http://app.kubernetes.io/name|app.kubernetes.io/name>: flyteadmin
data:
  admin.yaml: | 
    admin:
      clientId: 'flytepropeller'
      clientSecretLocation: /etc/secrets/client_secret
      endpoint: flyteadmin:81
      insecure: true
    event:
      capacity: 1000
      rate: 500
      type: admin
  catalog.yaml: | 
    catalog-cache:
      endpoint: datacatalog:89
      insecure: true
      type: datacatalog
  copilot.yaml: | 
    plugins:
      k8s:
        co-pilot:
          image: <http://cr.flyte.org/flyteorg/flytecopilot:v0.0.24|cr.flyte.org/flyteorg/flytecopilot:v0.0.24>
          name: flyte-copilot-
          start-timeout: 30s
  core.yaml: | 
    manager:
      pod-application: flytepropeller
      pod-template-container-name: flytepropeller
      pod-template-name: flytepropeller-template
    propeller:
      downstream-eval-duration: 30s
      enable-admin-launcher: true
      gc-interval: 12h
      kube-client-config:
        burst: 25
        qps: 100
        timeout: 30s
      leader-election:
        enabled: true
        lease-duration: 15s
        lock-config-map:
          name: propeller-leader
          namespace: flyte
        renew-deadline: 10s
        retry-period: 2s
      limit-namespace: all
      max-workflow-retries: 50
      metadata-prefix: metadata/propeller
      metrics-prefix: flyte
      prof-port: 10254
      queue:
        batch-size: -1
        batching-interval: 2s
        queue:
          base-delay: 5s
          capacity: 1000
          max-delay: 120s
          rate: 100
          type: maxof
        sub-queue:
          capacity: 1000
          rate: 100
          type: bucket
        type: batch
      rawoutput-prefix: s3://${ parameters.s3_bucket_name }/
      workers: 40
      workflow-reeval-duration: 30s
    webhook:
      certDir: /etc/webhook/certs
      serviceName: flyte-pod-webhook
  enabled_plugins.yaml: | 
    tasks:
      task-plugins:
        default-for-task-types:
          container: container
          container_array: k8s-array
          sidecar: sidecar
          spark: spark
        enabled-plugins:
        - container
        - sidecar
        - k8s-array
        - spark
  k8s.yaml: | 
    plugins:
      k8s:
        default-cpus: 100m
        default-env-vars: []
        default-memory: 100Mi
  resource_manager.yaml: | 
    propeller:
      resourcemanager:
        type: noop
  storage.yaml: | 
    storage:
      type: s3
      container: "${ parameters.s3_bucket_name }"
      connection:
        auth-type: iam
        region: ${ parameters.aws_region }
      limits:
        maxDownloadMBs: 10
  cache.yaml: |
    cache:
      max_size_mbs: 1024
      target_gc_percent: 70
  task_logs.yaml: | 
    plugins:
      logs:
        cloudwatch-enabled: false
        kubernetes-enabled: false
  spark.yaml: |
    plugins:
      spark:
        spark-config-default:
          - spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: "2"
          - spark.kubernetes.allocation.batch.size: "50"
          - spark.hadoop.fs.s3a.acl.default: "BucketOwnerFullControl"
          - spark.hadoop.fs.s3n.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
          - spark.hadoop.fs.AbstractFileSystem.s3n.impl: "org.apache.hadoop.fs.s3a.S3A"
          - spark.hadoop.fs.s3.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
          - spark.hadoop.fs.AbstractFileSystem.s3.impl: "org.apache.hadoop.fs.s3a.S3A"
          - spark.hadoop.fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
          - spark.hadoop.fs.AbstractFileSystem.s3a.impl: "org.apache.hadoop.fs.s3a.S3A"
          - spark.hadoop.fs.s3a.multipart.threshold: "536870912"
          - spark.blacklist.enabled: "true"
          - spark.blacklist.timeout: "5m"
          - spark.task.maxfailures: "8"
And my cluster resource template
Copy code
# Source: flyte-core/templates/clusterresourcesync/cluster_resource_configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: clusterresource-template
  namespace: flyte
  labels: 
    <http://app.kubernetes.io/name|app.kubernetes.io/name>: flyteadmin
    <http://helm.sh/chart|helm.sh/chart>: flyte-core-v0.1.10
data:
  aa_namespace.yaml: | 
    apiVersion: v1
    kind: Namespace
    metadata:
      name: {{ namespace }}
    spec:
      finalizers:
      - kubernetes
    
  aab_default_service_account.yaml: | 
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: default
      namespace: {{ namespace }}
    
  ab_project_resource_quota.yaml: | 
    apiVersion: v1
    kind: ResourceQuota
    metadata:
      name: project-quota
      namespace: {{ namespace }}
    spec:
      hard:
        limits.cpu: {{ projectQuotaCpu }}
        limits.memory: {{ projectQuotaMemory }}
    
  ac_spark_role.yaml: | 
    apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
    kind: Role
    metadata:
      name: spark-role
      namespace: {{ namespace }}
    rules:
    - apiGroups: ["*"]
      resources:
      - pods
      verbs:
      - '*'
    - apiGroups: ["*"]
      resources:
      - services
      verbs:
      - '*'
    - apiGroups: ["*"]
      resources:
      - configmaps
      verbs:
      - '*'
    
  ad_spark_service_account.yaml: | 
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: spark
      namespace: {{ namespace }}
    
  ae_spark_role_binding.yaml: | 
    apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
    kind: RoleBinding
    metadata:
      name: spark-role-binding
      namespace: {{ namespace }}
    roleRef:
      apiGroup: <http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>
      kind: Role
      name: spark-role
    subjects:
    - kind: ServiceAccount
      name: spark
      namespace: {{ namespace }}
d
Hey @Katrina P do you have the spark k8s operator deployed on the cluster? It looks like the spark plugin does not detect the CRD.
k
Yes, its been deployed on the cluster before Flyte even
I'll double check the namespace its in
hmmm yeah it seems to be there under the
spark-operator
namespace
I got the service to start up, but it still complains:
"No plugin found for Handler-type [spark], defaulting to [container]"
Also
"No plugin found for Handler-type [python-task], defaulting to [container]"
However based on some other slack messages I searched, this seems to be the normal behavior?
d
So for
python-task
this is normal. Basically, under the plugins configuration there is a mapping of task types to the plugin id. Typically these are the same so it seems a little redundant, example:
Copy code
enabled_plugins.yaml: | 
    tasks:
      task-plugins:
        default-for-task-types:
          container: container
          container_array: k8s-array
          sidecar: sidecar
          spark: spark
        enabled-plugins:
        - container
        - sidecar
        - k8s-array
        - spark
In propeller if we register a task-type that doesn't have an associated plugin it fallsback to the
container
plugin. As I mentioned, this is normal for
python-task
but for
spark
it could be an issue. Did you change this configuration?
cc @Ketan (kumare3) - help in setting up Spark plugin? Does the Spark k8s operator Namespace matter?
k
Config seems right
k
I guess this means that my spark jobs are running via container rather than SparkOperator -- my configs are as you posted my propeller configmap
k
@Katrina P not at my desk, will tal in a bit. @Samhita Alla / @Yuvraj / if you folks are around
y
@Katrina P Can you please post the output of these command ?
Copy code
kubectl api-versions | grep '<http://sparkoperator.k8s.io|sparkoperator.k8s.io>'
kubectl api-resources | grep '<http://sparkoperator.k8s.io|sparkoperator.k8s.io>'
k
Sure, I don't have kubectl access on our cluster so gotta find an on call engineer but will report back
👍 1
Copy code
$ kubectl --kubeconfig=ss-dev-new1 api-versions | grep '<http://sparkoperator.k8s.io|sparkoperator.k8s.io>'
<http://sparkoperator.k8s.io/v1beta2|sparkoperator.k8s.io/v1beta2>
Copy code
$ kubectl --kubeconfig=ss-dev-new1 api-resources | grep '<http://sparkoperator.k8s.io|sparkoperator.k8s.io>'
scheduledsparkapplications        scheduledsparkapp                    <http://sparkoperator.k8s.io/v1beta2|sparkoperator.k8s.io/v1beta2>           true         ScheduledSparkApplication
sparkapplications                 sparkapp                             <http://sparkoperator.k8s.io/v1beta2|sparkoperator.k8s.io/v1beta2>           true         SparkApplication
y
can you also send me your command for spark operator installation ? Also can you restart propeller deployment and send me startup logs. Did you follow the plugin docs https://docs.flyte.org/en/latest/deployment/plugin_setup/k8s/index.html#deployment-plugin-setup-k8s ?
k
The Spark Operator was already installed in our cluster by dev ops before we installed Flyte, I can ask how they installed it; we did follow the k8s plugin instructions to update the helm chart; we then used the chart to generate the updates to configmaps for propeller (above), admin custer role, and resource template (above) manifests
helm template flyte flyteorg/flyte-core -f <https://raw.githubusercontent.com/flyteorg/flyte/master/charts/flyte-core/values-sandbox.yaml> -f values-override.yaml -n flyte > spark-override.yaml
logs:
`
Copy code
time="2022-08-05T19:03:52Z" level=info msg=------------------------------------------------------------------------
time="2022-08-05T19:03:52Z" level=info msg="App [flytepropeller], Version [unknown], BuildSHA [unknown], BuildTS [2022-08-05 19:03:52.575011998 +0000 UTC m=+0.049406538]"
time="2022-08-05T19:03:52Z" level=info msg=------------------------------------------------------------------------
time="2022-08-05T19:03:52Z" level=info msg="Detected: 8 CPU's\n"
{"json":{},"level":"error","msg":"failed to initialize token source provider. Err: failed to fetch auth metadata. Error: rpc error: code = Unimplemented desc = unknown service flyteidl.service.AuthMetadataService","ts":"2022-08-05T19:03:53Z"}
{"json":{},"level":"warning","msg":"Starting an unauthenticated client because: can't create authenticated channel without a TokenSourceProvider","ts":"2022-08-05T19:03:53Z"}
{"json":{},"level":"error","msg":"failed to initialize token source provider. Err: failed to fetch auth metadata. Error: rpc error: code = Unimplemented desc = unknown service flyteidl.service.AuthMetadataService","ts":"2022-08-05T19:03:53Z"}
{"json":{},"level":"warning","msg":"Starting an unauthenticated client because: can't create authenticated channel without a TokenSourceProvider","ts":"2022-08-05T19:03:53Z"}
{"json":{},"level":"warning","msg":"defaulting max ttl for workflows to 23 hours, since configured duration is larger than 23 [23]","ts":"2022-08-05T19:03:53Z"}
{"json":{},"level":"warning","msg":"stow configuration section missing, defaulting to legacy s3/minio connection config","ts":"2022-08-05T19:03:53Z"}
I0805 19:03:53.592186       1 leaderelection.go:243] attempting to acquire leader lease flyte/propeller-leader...
I0805 19:04:10.415449       1 leaderelection.go:253] successfully acquired lease flyte/propeller-leader
k
@Katrina P did you get it working?
if you are around I can help
@Yuvraj I do not think it is a spark operator problem. For some reason the problem seems to be that the plugin is not getting registtered in the system
166 Views