We are facing issues with GPU the pod has no tolerations ass Flyte #flyte-support

We are facing issues with GPU, the pod has no tole...

elegant-toddler-67101

07/26/2023, 1:37 PM

We are facing issues with GPU, the pod has no tolerations assigned to although it sets gpu resource in the task configuration. We use

flyte-core

helm chart on GKE, we have node pool with taints. What are we missing?

Copy code

@task(
    container_image="{{.image.indeed.fqn}}:{{.image.indeed.version}}",
    requests=Resources(cpu="1", mem="2Gi", gpu="1"),
    limits=Resources(cpu="1", mem="3Gi")
)

and this configuration in

flyte-core

chart values:

Copy code

k8s:
  plugins:
    k8s:
      gpu-resource-name: <http://nvidia.com/gpu|nvidia.com/gpu>
      resource-tolerations:
        - <http://nvidia.com/gpu|nvidia.com/gpu>:
          - key: "<http://nvidia.com/gpu|nvidia.com/gpu>"
            operator: "Equal"
            value: "present"
            effect: "NoSchedule"

freezing-boots-56761

07/26/2023, 1:53 PM

GKE should auto-inject the toleration i believe

elegant-toddler-67101

07/26/2023, 1:54 PM

Yes, you’re right. But for some reason it doesn’t work, I don’t see the tolerations on the pod

freezing-boots-56761

07/26/2023, 1:54 PM

can you paste the REDACTED pod spec? what version of k8s are you running on GKE?

freezing-boots-56761

07/26/2023, 2:18 PM

you can also try running this pod manually to confirm:

Copy code

apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi
  namespace: default
spec:
  restartPolicy: OnFailure
  containers:
  - name: nvidia-smi
    image: nvidia/cuda:12.2.0-runtime-ubuntu20.04
    args:
    - "nvidia-smi"
    resources:
      limits:
        <http://nvidia.com/gpu|nvidia.com/gpu>: 1

elegant-toddler-67101

07/31/2023, 7:27 AM

It is wokring. The thing is - tolerations are not been attached to the pod, although our flyte configuration and task configuration (as mentioned above)

freezing-boots-56761

07/31/2023, 1:46 PM

were the tolerations attached automatically to the above pod though?

elegant-toddler-67101

08/07/2023, 2:46 PM

Sorry, it was an issue on our implementation. It works fine now. Tnx

👍 1

21 Views

Open in Slack

Previous Next