We ve been seeing some mysterious non interruptible node sel Flyte #flyte-support

We've been seeing some mysterious non-interruptibl...

hallowed-camera-82098

12/13/2023, 9:08 PM

We've been seeing some mysterious non-interruptible node selector affinities since enabling the flyte spark plugin in one of our deployments. It seems like each time we launch a task that uses the spark plugin it causes an extra non-interruptible-node-selector-requirement to show up (for any future tasks!). Interruptible tasks are then stuck pending with for example:

Copy code

- key: cloud.google.com/gke-spot
            operator: DoesNotExist
          - key: cloud.google.com/gke-spot
            operator: DoesNotExist
          - key: cloud.google.com/gke-spot
            operator: In
            values:
            - "true"

hallowed-camera-82098

12/14/2023, 1:08 AM

(this is with flyte-core 1.6.2)

tall-lock-23197

12/14/2023, 1:13 PM

could be because of https://github.com/flyteorg/flyteplugins/pull/346 PR.

full-ram-17934

12/14/2023, 6:48 PM

I am not sure if that change is responsible, but this behavior is really problematic because it affects all workflows with interruptible tasks that are run after a workflow with a spark task is run, whether or not they have spark tasks.

full-ram-17934

12/14/2023, 6:49 PM

(and for every subsequent spark task that runs, another copy of the non-interruptible node selector affinity will be added for future interruptible tasks)

tall-lock-23197

12/15/2023, 5:33 AM

the changes shouldn't propagate to non-spark tasks. not sure why that's happening. @high-accountant-32689 any idea what might be causing this issue?

full-ram-17934

12/15/2023, 5:50 PM

For more context: restarting propeller resolves the issue (no more spurious non-interruptible node selector affinities will be added to interruptible tasks), but only until the next time a spark task runs. Then, 1 copy of the affinity will be added to interruptible tasks; if another spark task runs, 2 copies are added, etc., until propeller is restarted again.

tall-lock-23197

12/18/2023, 5:08 AM

@average-finland-92144 / @glamorous-carpet-83516 any idea what might be causing this issue?

average-finland-92144

12/18/2023, 12:32 PM

@full-ram-17934 / @hallowed-camera-82098 what version of the Spark operator are you using? can you share the

spark-config-default

block you used in the Helm chart?

hallowed-camera-82098

12/18/2023, 2:24 PM

spark operator is version 1.1.27

Copy code

spark-config-default:
                - spark.kubernetes.authenticate.driver.serviceAccountName: "flyte-worker"
                - spark.hadoop.fs.gs.project:
                    "{{ .Values.userSettings.googleProjectId }}"
                - spark.hadoop.fs.gs.impl: "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
                - spark.hadoop.fs.AbstractFileSystem.gs.impl: "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
                - spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: "2"
                - spark.kubernetes.allocation.batch.size: "50"
                - spark.excludeOnFailure.enabled: "true"
                - spark.excludeOnFailure.timeout: "5m"
                - spark.task.maxfailures: "4"

hallowed-mouse-14616

12/18/2023, 2:33 PM

cc @flat-area-42876 mind taking a look here - it seems the spark plugin may be updating the

interruptible

configuration.

👍 1

hallowed-mouse-14616

12/18/2023, 2:36 PM

@hallowed-camera-82098 would you mind creating a github ticket for this? As an issue with correctness, this is something we will investigate immediately.

gratitude thank you 1

hallowed-camera-82098

12/18/2023, 2:42 PM

https://github.com/flyteorg/flyte/issues/4609 Thanks!

🙌 3

flat-area-42876

12/20/2023, 9:13 PM

Hi @hallowed-camera-82098 @full-ram-17934 👋 Thanks for pointing this out. I'm not able to reproduce this issue and am not able to identify the bug while stepping through the code. Would you be able to reproduce this on a demo cluster and share the set configurations?

👍 1

hallowed-camera-82098

12/21/2023, 6:11 PM

Thanks for looking into this Paul and others, I tried upgrading our Flyte this morning to 1.10.6 (from 1.6.2 except flytepropeller pinned to 1.1.95) and that seems to have resolved the issue, probably should have been the first thing I tried.

full-ram-17934

12/21/2023, 7:48 PM

Given the nature of the correctness issue, I wonder if it's worth backporting a fix? Or at least trying to add some kind of test that would catch a regression. Upgrading works for us though. Thanks for the investigation.

flat-area-42876

12/21/2023, 9:53 PM

I just tried it out again and was able to repro the issue after resetting my local dev cluster. Apologies for the confusion - I think I had mistakingly pinned flyteplugins to 1.1.27 while running flyte on 1.6.2. I'll look into this more later this evening to find the exact cause. @hallowed-mouse-14616 is there a general procedure for updating bugs on an old release that doesn't exist on the latest release? @full-ram-17934 we can look into setting up a test to catch this is the future.

flat-area-42876

12/21/2023, 10:23 PM

hmm - unsure what I did with my local setup previously, but the pinning of 1.1.27 was not it. The issue was here:

Affinity:         config.GetK8sPluginConfig().DefaultAffinity,

Should be

Affinity:         config.GetK8sPluginConfig().DefaultAffinity.DeepCopy()

Pointer to DefaultAffinity value would get set in the spark driverSpec -> resource Spec -> update node selector requirements altering the set DefaultAffinity until propellor restarts. The issue was handled with this PR.

🎉 4

hallowed-mouse-14616

12/22/2023, 2:34 PM

Right now we don't offer LTS support for any Flyte releases, so updating is the only way to fix some of these bugs. There have been conversations about revisiting this, maybe @high-accountant-32689 can elaborate?!?

3 Views

Open in Slack

Previous Next