Hey all, I'm trying to propagate a some `toleratio...
# flyte-support
a
Hey all, I'm trying to propagate a some
tolerations
to the driver/executor pods that get launched via the Flyte Spark plugin and I must be missing something on how this works; the relevant section of my configuration is in the ๐Ÿงต, and I think I'm reading the relevant bits of the Spark plugin correctly, but for whatever reason my tolerations aren't making the leap from the configuration to the pods. Any help from folks who have figured this out before would be very much appreciated! ๐Ÿ™‡
Relevant section of the config for the
flyte-backend
looks like this:
Copy code
plugins:
  k8s:
    default-env-vars:
    - AWS_METADATA_SERVICE_TIMEOUT: 5
    - AWS_METADATA_SERVICE_NUM_ATTEMPTS: 20
    default-tolerations:
    - effect: NoSchedule
      key: datology-job-type
      operator: Exists
    inject-finalizer: true
  spark:
    spark-config-default:
    - spark.eventLog.enabled: "true"
    - spark.eventLog.dir: <s3a://dev-datologyai-job-logs/dev-next-spark-operator-logs>
    - spark.eventLog.rolling.enabled: "true"
    - spark.eventLog.rolling.maxFileSize: 16m
    - spark.kubernetes.authenticate.submission.caCertFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    - spark.kubernetes.authenticate.submission.oauthTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    - spark.hadoop.fs.s3a.aws.credentials.provider: com.amazonaws.auth.WebIdentityTokenCredentialsProvider
    - spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
    - spark.driver.extraJavaOptions: -Divy.cache.dir=/tmp -Divy.home=/tmp
storage:
  cache:
    max_size_mbs: 100
    target_gc_percent: 100
The tolerations show up on all of my flyte task pods except for the pods that get launched via the SparkOperator
f
Aah the joys of spark operator
but, this might be something to look into
a
@acoustic-nest-94594 I don't think
default-tolerations
would help bc it injects tols into Pods spawned by the propeller K8s plugin, and Spark is a different one. The only relevant section I see is if you could use `plugins.spark.spark-config-default`to set the tolerations that the operator ends up applying to the Driver/executor Pods. At the spark-operator Helm chart level I can only see tolerations for the controller itself. From the operator API docs, it doesn't seem that tolerations are even configurable for the Driver/Executor but I may be wrong.
a
Hey @average-finland-92144! I think my fallback plan is to use the
spark.kubernetes.{driver/executor}.podTemplateFile
property in those spark configs to create a pod template that includes the tolerations, I was just surprised b/c it looked like the Flyte
spark.go
code was using those
k8s.default-tolerations
(and the other settings for the pods under
k8s
) to setup the default podspec that was getting passed in to the
createSparkPodSpec
function from e.g. here: https://github.com/flyteorg/flyte/blob/master/flyteplugins/go/tasks/plugins/k8s/spark/spark.go#L177
Yo, just wanted to report back that I finally got this to work by properly configuring the mutating webhook for the Spark Operator: https://www.kubeflow.org/docs/components/spark-operator/getting-started/#about-the-mutating-admission-webhook -- you just need to make sure it's setup on your SparkOperator + that you have alignment between the namespaces/service accounts that Flyte is using and that the mutating webhook is watching.
f
Ohh yes but this is because Spark driver kicks of pods and the driver code is old does not use pod specs sadly
We should upstream this to spark