We've been seeing some mysterious non-interruptibl...
# flyte-support
h
We've been seeing some mysterious non-interruptible node selector affinities since enabling the flyte spark plugin in one of our deployments. It seems like each time we launch a task that uses the spark plugin it causes an extra non-interruptible-node-selector-requirement to show up (for any future tasks!). Interruptible tasks are then stuck pending with for example:
Copy code
- key: cloud.google.com/gke-spot
            operator: DoesNotExist
          - key: cloud.google.com/gke-spot
            operator: DoesNotExist
          - key: cloud.google.com/gke-spot
            operator: In
            values:
            - "true"
(this is with flyte-core 1.6.2)
t
f
I am not sure if that change is responsible, but this behavior is really problematic because it affects all workflows with interruptible tasks that are run after a workflow with a spark task is run, whether or not they have spark tasks.
(and for every subsequent spark task that runs, another copy of the non-interruptible node selector affinity will be added for future interruptible tasks)
t
the changes shouldn't propagate to non-spark tasks. not sure why that's happening. @high-accountant-32689 any idea what might be causing this issue?
f
For more context: restarting propeller resolves the issue (no more spurious non-interruptible node selector affinities will be added to interruptible tasks), but only until the next time a spark task runs. Then, 1 copy of the affinity will be added to interruptible tasks; if another spark task runs, 2 copies are added, etc., until propeller is restarted again.
t
@average-finland-92144 / @glamorous-carpet-83516 any idea what might be causing this issue?
a
@full-ram-17934 / @hallowed-camera-82098 what version of the Spark operator are you using? can you share the
spark-config-default
block you used in the Helm chart?
h
spark operator is version 1.1.27
Copy code
spark-config-default:
                - spark.kubernetes.authenticate.driver.serviceAccountName: "flyte-worker"
                - spark.hadoop.fs.gs.project:
                    "{{ .Values.userSettings.googleProjectId }}"
                - spark.hadoop.fs.gs.impl: "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
                - spark.hadoop.fs.AbstractFileSystem.gs.impl: "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
                - spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: "2"
                - spark.kubernetes.allocation.batch.size: "50"
                - spark.excludeOnFailure.enabled: "true"
                - spark.excludeOnFailure.timeout: "5m"
                - spark.task.maxfailures: "4"
h
cc @flat-area-42876 mind taking a look here - it seems the spark plugin may be updating the
interruptible
configuration.
👍 1
@hallowed-camera-82098 would you mind creating a github ticket for this? As an issue with correctness, this is something we will investigate immediately.
gratitude thank you 1
h
f
Hi @hallowed-camera-82098 @full-ram-17934 👋 Thanks for pointing this out. I'm not able to reproduce this issue and am not able to identify the bug while stepping through the code. Would you be able to reproduce this on a demo cluster and share the set configurations?
👍 1
h
Thanks for looking into this Paul and others, I tried upgrading our Flyte this morning to 1.10.6 (from 1.6.2 except flytepropeller pinned to 1.1.95) and that seems to have resolved the issue, probably should have been the first thing I tried.
f
Given the nature of the correctness issue, I wonder if it's worth backporting a fix? Or at least trying to add some kind of test that would catch a regression. Upgrading works for us though. Thanks for the investigation.
f
I just tried it out again and was able to repro the issue after resetting my local dev cluster. Apologies for the confusion - I think I had mistakingly pinned flyteplugins to 1.1.27 while running flyte on 1.6.2. I'll look into this more later this evening to find the exact cause. @hallowed-mouse-14616 is there a general procedure for updating bugs on an old release that doesn't exist on the latest release? @full-ram-17934 we can look into setting up a test to catch this is the future.
hmm - unsure what I did with my local setup previously, but the pinning of 1.1.27 was not it. The issue was here:
Affinity:         config.GetK8sPluginConfig().DefaultAffinity,
Should be
Affinity:         config.GetK8sPluginConfig().DefaultAffinity.DeepCopy()
Pointer to DefaultAffinity value would get set in the spark driverSpec -> resource Spec -> update node selector requirements altering the set DefaultAffinity until propellor restarts. The issue was handled with this PR.
🎉 4
h
Right now we don't offer LTS support for any Flyte releases, so updating is the only way to fix some of these bugs. There have been conversations about revisiting this, maybe @high-accountant-32689 can elaborate?!?