We've been seeing some mysterious non-interruptibl...
# ask-the-community
o
We've been seeing some mysterious non-interruptible node selector affinities since enabling the flyte spark plugin in one of our deployments. It seems like each time we launch a task that uses the spark plugin it causes an extra non-interruptible-node-selector-requirement to show up (for any future tasks!). Interruptible tasks are then stuck pending with for example:
Copy code
- key: cloud.google.com/gke-spot
            operator: DoesNotExist
          - key: cloud.google.com/gke-spot
            operator: DoesNotExist
          - key: cloud.google.com/gke-spot
            operator: In
            values:
            - "true"
(this is with flyte-core 1.6.2)
s
m
I am not sure if that change is responsible, but this behavior is really problematic because it affects all workflows with interruptible tasks that are run after a workflow with a spark task is run, whether or not they have spark tasks.
(and for every subsequent spark task that runs, another copy of the non-interruptible node selector affinity will be added for future interruptible tasks)
s
the changes shouldn't propagate to non-spark tasks. not sure why that's happening. @Eduardo Apolinario (eapolinario) any idea what might be causing this issue?
m
For more context: restarting propeller resolves the issue (no more spurious non-interruptible node selector affinities will be added to interruptible tasks), but only until the next time a spark task runs. Then, 1 copy of the affinity will be added to interruptible tasks; if another spark task runs, 2 copies are added, etc., until propeller is restarted again.
s
@David Espejo (he/him) / @Kevin Su any idea what might be causing this issue?
d
@Matthew Corley / @Olivier Sevin what version of the Spark operator are you using? can you share the
spark-config-default
block you used in the Helm chart?
o
spark operator is version 1.1.27
Copy code
spark-config-default:
                - spark.kubernetes.authenticate.driver.serviceAccountName: "flyte-worker"
                - spark.hadoop.fs.gs.project:
                    "{{ .Values.userSettings.googleProjectId }}"
                - spark.hadoop.fs.gs.impl: "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
                - spark.hadoop.fs.AbstractFileSystem.gs.impl: "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
                - spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: "2"
                - spark.kubernetes.allocation.batch.size: "50"
                - spark.excludeOnFailure.enabled: "true"
                - spark.excludeOnFailure.timeout: "5m"
                - spark.task.maxfailures: "4"
d
cc @Paul Dittamo mind taking a look here - it seems the spark plugin may be updating the
interruptible
configuration.
@Olivier Sevin would you mind creating a github ticket for this? As an issue with correctness, this is something we will investigate immediately.
o
p
Hi @Olivier Sevin @Matthew Corley 👋 Thanks for pointing this out. I'm not able to reproduce this issue and am not able to identify the bug while stepping through the code. Would you be able to reproduce this on a demo cluster and share the set configurations?
o
Thanks for looking into this Paul and others, I tried upgrading our Flyte this morning to 1.10.6 (from 1.6.2 except flytepropeller pinned to 1.1.95) and that seems to have resolved the issue, probably should have been the first thing I tried.
m
Given the nature of the correctness issue, I wonder if it's worth backporting a fix? Or at least trying to add some kind of test that would catch a regression. Upgrading works for us though. Thanks for the investigation.
p
I just tried it out again and was able to repro the issue after resetting my local dev cluster. Apologies for the confusion - I think I had mistakingly pinned flyteplugins to 1.1.27 while running flyte on 1.6.2. I'll look into this more later this evening to find the exact cause. @Dan Rammer (hamersaw) is there a general procedure for updating bugs on an old release that doesn't exist on the latest release? @Matthew Corley we can look into setting up a test to catch this is the future.
hmm - unsure what I did with my local setup previously, but the pinning of 1.1.27 was not it. The issue was here:
Affinity:         config.GetK8sPluginConfig().DefaultAffinity,
Should be
Affinity:         config.GetK8sPluginConfig().DefaultAffinity.DeepCopy()
Pointer to DefaultAffinity value would get set in the spark driverSpec -> resource Spec -> update node selector requirements altering the set DefaultAffinity until propellor restarts. The issue was handled with this PR.
d
Right now we don't offer LTS support for any Flyte releases, so updating is the only way to fix some of these bugs. There have been conversations about revisiting this, maybe @Eduardo Apolinario (eapolinario) can elaborate?!?