https://flyte.org logo
#ask-the-community
Title
# ask-the-community
o

Olivier Sevin

12/13/2023, 9:08 PM
We've been seeing some mysterious non-interruptible node selector affinities since enabling the flyte spark plugin in one of our deployments. It seems like each time we launch a task that uses the spark plugin it causes an extra non-interruptible-node-selector-requirement to show up (for any future tasks!). Interruptible tasks are then stuck pending with for example:
Copy code
- key: cloud.google.com/gke-spot
            operator: DoesNotExist
          - key: cloud.google.com/gke-spot
            operator: DoesNotExist
          - key: cloud.google.com/gke-spot
            operator: In
            values:
            - "true"
(this is with flyte-core 1.6.2)
s

Samhita Alla

12/14/2023, 1:13 PM
m

Matthew Corley

12/14/2023, 6:48 PM
I am not sure if that change is responsible, but this behavior is really problematic because it affects all workflows with interruptible tasks that are run after a workflow with a spark task is run, whether or not they have spark tasks.
(and for every subsequent spark task that runs, another copy of the non-interruptible node selector affinity will be added for future interruptible tasks)
s

Samhita Alla

12/15/2023, 5:33 AM
the changes shouldn't propagate to non-spark tasks. not sure why that's happening. @Eduardo Apolinario (eapolinario) any idea what might be causing this issue?
m

Matthew Corley

12/15/2023, 5:50 PM
For more context: restarting propeller resolves the issue (no more spurious non-interruptible node selector affinities will be added to interruptible tasks), but only until the next time a spark task runs. Then, 1 copy of the affinity will be added to interruptible tasks; if another spark task runs, 2 copies are added, etc., until propeller is restarted again.
s

Samhita Alla

12/18/2023, 5:08 AM
@David Espejo (he/him) / @Kevin Su any idea what might be causing this issue?
d

David Espejo (he/him)

12/18/2023, 12:32 PM
@Matthew Corley / @Olivier Sevin what version of the Spark operator are you using? can you share the
spark-config-default
block you used in the Helm chart?
o

Olivier Sevin

12/18/2023, 2:24 PM
spark operator is version 1.1.27
Copy code
spark-config-default:
                - spark.kubernetes.authenticate.driver.serviceAccountName: "flyte-worker"
                - spark.hadoop.fs.gs.project:
                    "{{ .Values.userSettings.googleProjectId }}"
                - spark.hadoop.fs.gs.impl: "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
                - spark.hadoop.fs.AbstractFileSystem.gs.impl: "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
                - spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: "2"
                - spark.kubernetes.allocation.batch.size: "50"
                - spark.excludeOnFailure.enabled: "true"
                - spark.excludeOnFailure.timeout: "5m"
                - spark.task.maxfailures: "4"
d

Dan Rammer (hamersaw)

12/18/2023, 2:33 PM
cc @Paul Dittamo mind taking a look here - it seems the spark plugin may be updating the
interruptible
configuration.
@Olivier Sevin would you mind creating a github ticket for this? As an issue with correctness, this is something we will investigate immediately.
o

Olivier Sevin

12/18/2023, 2:42 PM
p

Paul Dittamo

12/20/2023, 9:13 PM
Hi @Olivier Sevin @Matthew Corley 👋 Thanks for pointing this out. I'm not able to reproduce this issue and am not able to identify the bug while stepping through the code. Would you be able to reproduce this on a demo cluster and share the set configurations?
o

Olivier Sevin

12/21/2023, 6:11 PM
Thanks for looking into this Paul and others, I tried upgrading our Flyte this morning to 1.10.6 (from 1.6.2 except flytepropeller pinned to 1.1.95) and that seems to have resolved the issue, probably should have been the first thing I tried.
m

Matthew Corley

12/21/2023, 7:48 PM
Given the nature of the correctness issue, I wonder if it's worth backporting a fix? Or at least trying to add some kind of test that would catch a regression. Upgrading works for us though. Thanks for the investigation.
p

Paul Dittamo

12/21/2023, 9:53 PM
I just tried it out again and was able to repro the issue after resetting my local dev cluster. Apologies for the confusion - I think I had mistakingly pinned flyteplugins to 1.1.27 while running flyte on 1.6.2. I'll look into this more later this evening to find the exact cause. @Dan Rammer (hamersaw) is there a general procedure for updating bugs on an old release that doesn't exist on the latest release? @Matthew Corley we can look into setting up a test to catch this is the future.
hmm - unsure what I did with my local setup previously, but the pinning of 1.1.27 was not it. The issue was here:
Affinity:         config.GetK8sPluginConfig().DefaultAffinity,
Should be
Affinity:         config.GetK8sPluginConfig().DefaultAffinity.DeepCopy()
Pointer to DefaultAffinity value would get set in the spark driverSpec -> resource Spec -> update node selector requirements altering the set DefaultAffinity until propellor restarts. The issue was handled with this PR.
d

Dan Rammer (hamersaw)

12/22/2023, 2:34 PM
Right now we don't offer LTS support for any Flyte releases, so updating is the only way to fix some of these bugs. There have been conversations about revisiting this, maybe @Eduardo Apolinario (eapolinario) can elaborate?!?