Hey everyone, We want to use GPUs with Flyte and followed this <tutorial>. We also have created a n...
e
Hey everyone, We want to use GPUs with Flyte and followed this tutorial. We also have created a node pool with GPUs in our K8s cluster. We use
flyte-binary
and specify the following configuration:
Copy code
configuration:
  inline:
    plugins:
      k8s:
        resource-tolerations:
          - <http://nvidia.com/gpu|nvidia.com/gpu>: 
              - key: "mykey"
                operator: "Equal"
                value: "myvalue"
                effect: "NoSchedule"
Additionally, we specify the `nodeSelector`:
Copy code
configuration:
  inline:
    plugins:
      k8s:
        gpu-device-node-label: "<http://cloud.google.com/gke-accelerator|cloud.google.com/gke-accelerator>"
Moreover, we set
default-node-selector
to another node pool (spot) to run other CPU-focused workloads. The issue we're facing is that the pod requesting a GPU seems to get matched with the correct node
gpu-device-node-label
, but it results in an error stating that the label of
gpu-device-node-label
does not match
default-node-selector
. Additionally, when inspecting the pod requesting a GPU with
kubectl
, I notice that the
Node-Selectors
field still includes
default-node-selector
. Can someone help me with this?
a
Could you share the
kubectl describe
of a task Pod? As a side note I think
gpu-device-node-label
should be called
gpu-device-node-selector
to reduce confusion So @elegant-parrot-47406 is the expected behavior that if a task requests a GPU, ONLY the GPU node selector is injected and for those that don't requests GPUs they ONLY get the default node selector? Do you also get errors in tasks with no gpus?
e
Hey @average-finland-92144 thanks for answering so quickly! • According to the documentation its
gpu-device-node-label
• The behavior of pods not requesting GPUS is ok. It works as expected and the get the default node selector. So all good here • The expected behaviour for pods requesting GPUs should be: only the GPU node selector is injected
For a pod requesting a GPU
kubectl describe
would tell something like
Copy code
nodeSelector:
    default-node-selector
  .....
  tolerations:
  - effect: ....
a
• According to the documentation its
gpu-device-node-label
Oh sure, it's more like me complaining that Flyte should name it like what it really is: a selector, the label is on the nodes, but maybe just a semantics issue. The behavior of the
default-node-selector
being injected to all Pods is expected.
The "Flyte way" of handling spot instances is setting the task that need to use them to
interrutible=True
in the task decorator. To better control scheduling you can set interruptible-node-selector to match the labels and conditions that your spot instances have configured
then Flyte will handle either spot or GPU requests properly