Trying to use GPUs I added a tolerations section as describe Flyte #announcements

Trying to use GPUs, I added a tolerations section ...

elegant-petabyte-32634

06/01/2022, 3:57 PM

Trying to use GPUs, I added a tolerations section as described here https://docs.flyte.org/projects/cookbook/en/stable/auto/deployment/configure_use_gpus.html (and in a previous comment where it was clarified where to apply this https://flyte-org.slack.com/archives/CNMKCU6FR/p1651591056890689?thread_ts=1651584781.772139&cid=CNMKCU6FR), ie.

Copy code

# -- Kubernetes specific Flyte configuration
  k8s:
    plugins:
      # -- Configuration section for all K8s specific plugins [Configuration structure](<https://pkg.go.dev/github.com/lyft/flyteplugins/go/tasks/pluginmachinery/flytek8s/config>)
      k8s:
        default-env-vars: []
        #  DEFAULT_ENV_VAR: VALUE
        default-cpus: 100m
        default-memory: 100Mi

        resource-tolerations:
          - <http://nvidia.com/gpu|nvidia.com/gpu>:
            - key: "key1"
              operator: "Equal"
              value: "value1"
              effect: "NoSchedule"

and I applied that with helm and also tried restarting the Flyte pods (kubectl rollout restart deploy), but the pods that get started by Flyte workflows don't get these tolerations (although they do get a default nvidia.com/gpu "exists" toleration regardless of my addition above). Anything I'm doing wrong?

elegant-petabyte-32634

06/01/2022, 4:01 PM

also tried putting resource-tolerations one level higher so its under plugins, but not working either

acceptable-policeman-57188

06/01/2022, 5:07 PM

hey @elegant-petabyte-32634 just to double check, the tasks with non-zero gpu resource requests are also not getting the tolerations?

elegant-petabyte-32634

06/01/2022, 5:38 PM

Hey! Yes I tried it with a task thats requesting 1 gpu

elegant-petabyte-32634

06/01/2022, 5:40 PM

Copy code

from flytekit import task, workflow, Resources


@task(
    requests=Resources(gpu="1", cpu="2"),
    limits=Resources(mem="8Gi"),
)
def test_gpu():
    ...


@workflow
def wf():
    test_gpu()

# pyflyte run --remote -p flytesnacks -d development testgpu.py wf

acceptable-policeman-57188

06/01/2022, 5:45 PM

just to double check, are you overwriting the value of gpu-resource-name in your config?

elegant-petabyte-32634

06/01/2022, 5:47 PM

no i'm not

acceptable-policeman-57188

06/01/2022, 5:51 PM

do get a default nvidia.com/gpu "exists" toleration regardless of my addition above).

is this only for the test_gpu task pod or all pods?

elegant-petabyte-32634

06/01/2022, 5:53 PM

ill check, sec

elegant-petabyte-32634

06/01/2022, 5:54 PM

the non-gpu ones don't get it no

acceptable-policeman-57188

06/01/2022, 5:59 PM

can we double check real quick that your config is being parsed? do you mind port-forwarding propeller

kubectl -n flyte port-forward deploy/flytepropeller 10254

and going to http://localhost:10254/config

elegant-petabyte-32634

06/01/2022, 6:00 PM

sure

elegant-petabyte-32634

06/01/2022, 6:03 PM

resource-tolerations is just null there, are we sure the previous comment was right in adding it to that plugins section and not to flytepropeller's for example?

acceptable-policeman-57188

06/01/2022, 6:05 PM

cool so something is not being set correctly in the yaml, however it should be in the plugins section, that does look correct

acceptable-policeman-57188

06/01/2022, 6:06 PM

I'm not sure why you have the top-most k8s block though

elegant-petabyte-32634

06/01/2022, 6:06 PM

it's there in the official values.yaml too https://github.com/flyteorg/flyte/blob/master/charts/flyte-core/values.yaml#L639

acceptable-policeman-57188

06/01/2022, 6:07 PM

https://github.com/flyteorg/flyte/blob/master/deployment/eks/flyte_generated.yaml#L8302