https://flyte.org logo
#ask-the-community
Title
# ask-the-community
a

Ariel Kaspit

07/26/2023, 1:37 PM
We are facing issues with GPU, the pod has no tolerations assigned to although it sets gpu resource in the task configuration. We use
flyte-core
helm chart on GKE, we have node pool with taints. What are we missing?
Copy code
@task(
    container_image="{{.image.indeed.fqn}}:{{.image.indeed.version}}",
    requests=Resources(cpu="1", mem="2Gi", gpu="1"),
    limits=Resources(cpu="1", mem="3Gi")
)
and this configuration in
flyte-core
chart values:
Copy code
k8s:
  plugins:
    k8s:
      gpu-resource-name: <http://nvidia.com/gpu|nvidia.com/gpu>
      resource-tolerations:
        - <http://nvidia.com/gpu|nvidia.com/gpu>:
          - key: "<http://nvidia.com/gpu|nvidia.com/gpu>"
            operator: "Equal"
            value: "present"
            effect: "NoSchedule"
j

jeev

07/26/2023, 1:53 PM
GKE should auto-inject the toleration i believe
a

Ariel Kaspit

07/26/2023, 1:54 PM
Yes, you’re right. But for some reason it doesn’t work, I don’t see the tolerations on the pod
j

jeev

07/26/2023, 1:54 PM
can you paste the REDACTED pod spec? what version of k8s are you running on GKE?
you can also try running this pod manually to confirm:
Copy code
apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi
  namespace: default
spec:
  restartPolicy: OnFailure
  containers:
  - name: nvidia-smi
    image: nvidia/cuda:12.2.0-runtime-ubuntu20.04
    args:
    - "nvidia-smi"
    resources:
      limits:
        <http://nvidia.com/gpu|nvidia.com/gpu>: 1
a

Ariel Kaspit

07/31/2023, 7:27 AM
It is wokring. The thing is - tolerations are not been attached to the pod, although our flyte configuration and task configuration (as mentioned above)
j

jeev

07/31/2023, 1:46 PM
were the tolerations attached automatically to the above pod though?
a

Ariel Kaspit

08/07/2023, 2:46 PM
Sorry, it was an issue on our implementation. It works fine now. Tnx
6 Views