Trying to use GPUs, I added a tolerations section ...
# announcements
r
Trying to use GPUs, I added a tolerations section as described here https://docs.flyte.org/projects/cookbook/en/stable/auto/deployment/configure_use_gpus.html (and in a previous comment where it was clarified where to apply this https://flyte-org.slack.com/archives/CNMKCU6FR/p1651591056890689?thread_ts=1651584781.772139&cid=CNMKCU6FR), ie.
Copy code
# -- Kubernetes specific Flyte configuration
  k8s:
    plugins:
      # -- Configuration section for all K8s specific plugins [Configuration structure](<https://pkg.go.dev/github.com/lyft/flyteplugins/go/tasks/pluginmachinery/flytek8s/config>)
      k8s:
        default-env-vars: []
        #  DEFAULT_ENV_VAR: VALUE
        default-cpus: 100m
        default-memory: 100Mi

        resource-tolerations:
          - <http://nvidia.com/gpu|nvidia.com/gpu>:
            - key: "key1"
              operator: "Equal"
              value: "value1"
              effect: "NoSchedule"
and I applied that with helm and also tried restarting the Flyte pods (kubectl rollout restart deploy), but the pods that get started by Flyte workflows don't get these tolerations (although they do get a default nvidia.com/gpu "exists" toleration regardless of my addition above). Anything I'm doing wrong?
also tried putting resource-tolerations one level higher so its under plugins, but not working either
k
hey @Robin Kahlow just to double check, the tasks with non-zero gpu resource requests are also not getting the tolerations?
r
Hey! Yes I tried it with a task thats requesting 1 gpu
Copy code
from flytekit import task, workflow, Resources


@task(
    requests=Resources(gpu="1", cpu="2"),
    limits=Resources(mem="8Gi"),
)
def test_gpu():
    ...


@workflow
def wf():
    test_gpu()

# pyflyte run --remote -p flytesnacks -d development testgpu.py wf
k
just to double check, are you overwriting the value of gpu-resource-name in your config?
r
no i'm not
k
do get a default nvidia.com/gpu "exists" toleration regardless of my addition above).
is this only for the test_gpu task pod or all pods?
r
ill check, sec
the non-gpu ones don't get it no
k
can we double check real quick that your config is being parsed? do you mind port-forwarding propeller
kubectl -n flyte port-forward deploy/flytepropeller 10254
and going to http://localhost:10254/config
r
sure
resource-tolerations is just null there, are we sure the previous comment was right in adding it to that plugins section and not to flytepropeller's for example?
k
cool so something is not being set correctly in the yaml, however it should be in the plugins section, that does look correct
I'm not sure why you have the top-most k8s block though
r
i think that top level key in the values.yaml is used to name the yaml file in the generated configmap
r
ah
oh whoops, I was editing the wrong values file... sorry for wasting your time
k
no problem!
172 Views