Hey everyone, We want to use GPUs with Flyte and followed this <tutorial>. We also have created a n...

elegant-parrot-47406

02/06/2025, 6:17 PM

Hey everyone, We want to use GPUs with Flyte and followed this tutorial. We also have created a node pool with GPUs in our K8s cluster. We use

flyte-binary

and specify the following configuration:

Copy code

configuration:
  inline:
    plugins:
      k8s:
        resource-tolerations:
          - <http://nvidia.com/gpu|nvidia.com/gpu>: 
              - key: "mykey"
                operator: "Equal"
                value: "myvalue"
                effect: "NoSchedule"

Additionally, we specify the `nodeSelector`:

Copy code

configuration:
  inline:
    plugins:
      k8s:
        gpu-device-node-label: "<http://cloud.google.com/gke-accelerator|cloud.google.com/gke-accelerator>"

Moreover, we set

default-node-selector

to another node pool (spot) to run other CPU-focused workloads. The issue we're facing is that the pod requesting a GPU seems to get matched with the correct node

gpu-device-node-label

, but it results in an error stating that the label of

gpu-device-node-label

does not match

default-node-selector

. Additionally, when inspecting the pod requesting a GPU with

kubectl

, I notice that the

Node-Selectors

field still includes

default-node-selector

. Can someone help me with this?

average-finland-92144

02/06/2025, 8:28 PM

Could you share the

kubectl describe

of a task Pod? As a side note I think

gpu-device-node-label

should be called

gpu-device-node-selector

to reduce confusion So @elegant-parrot-47406 is the expected behavior that if a task requests a GPU, ONLY the GPU node selector is injected and for those that don't requests GPUs they ONLY get the default node selector? Do you also get errors in tasks with no gpus?

elegant-parrot-47406

02/07/2025, 8:29 AM

Hey @average-finland-92144 thanks for answering so quickly! • According to the documentation its

gpu-device-node-label

• The behavior of pods not requesting GPUS is ok. It works as expected and the get the default node selector. So all good here • The expected behaviour for pods requesting GPUs should be: only the GPU node selector is injected

elegant-parrot-47406

02/07/2025, 8:31 AM

For a pod requesting a GPU

kubectl describe

would tell something like

Copy code

nodeSelector:
    default-node-selector
  .....
  tolerations:
  - effect: ....

average-finland-92144

02/10/2025, 6:58 PM

• According to the documentation its
gpu-device-node-label

Oh sure, it's more like me complaining that Flyte should name it like what it really is: a selector, the label is on the nodes, but maybe just a semantics issue. The behavior of the

default-node-selector

being injected to all Pods is expected.

average-finland-92144

02/10/2025, 7:00 PM

The "Flyte way" of handling spot instances is setting the task that need to use them to

interrutible=True

in the task decorator. To better control scheduling you can set interruptible-node-selector to match the labels and conditions that your spot instances have configured

average-finland-92144

02/10/2025, 7:01 PM

then Flyte will handle either spot or GPU requests properly

average-finland-92144

02/10/2025, 7:01 PM

https://docs.flyte.org/en/latest/user_guide/productionizing/spot_instances.html#setting-up-spot-instances

4 Views

Open in Slack

Previous Next

Flyte

Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.