https://flyte.org logo
#announcements
Title
# announcements
r

Robin Kahlow

06/01/2022, 3:57 PM
Trying to use GPUs, I added a tolerations section as described here https://docs.flyte.org/projects/cookbook/en/stable/auto/deployment/configure_use_gpus.html (and in a previous comment where it was clarified where to apply this https://flyte-org.slack.com/archives/CNMKCU6FR/p1651591056890689?thread_ts=1651584781.772139&cid=CNMKCU6FR), ie.
Copy code
# -- Kubernetes specific Flyte configuration
  k8s:
    plugins:
      # -- Configuration section for all K8s specific plugins [Configuration structure](<https://pkg.go.dev/github.com/lyft/flyteplugins/go/tasks/pluginmachinery/flytek8s/config>)
      k8s:
        default-env-vars: []
        #  DEFAULT_ENV_VAR: VALUE
        default-cpus: 100m
        default-memory: 100Mi

        resource-tolerations:
          - <http://nvidia.com/gpu|nvidia.com/gpu>:
            - key: "key1"
              operator: "Equal"
              value: "value1"
              effect: "NoSchedule"
and I applied that with helm and also tried restarting the Flyte pods (kubectl rollout restart deploy), but the pods that get started by Flyte workflows don't get these tolerations (although they do get a default nvidia.com/gpu "exists" toleration regardless of my addition above). Anything I'm doing wrong?
also tried putting resource-tolerations one level higher so its under plugins, but not working either
k

katrina

06/01/2022, 5:07 PM
hey @Robin Kahlow just to double check, the tasks with non-zero gpu resource requests are also not getting the tolerations?
r

Robin Kahlow

06/01/2022, 5:38 PM
Hey! Yes I tried it with a task thats requesting 1 gpu
Copy code
from flytekit import task, workflow, Resources


@task(
    requests=Resources(gpu="1", cpu="2"),
    limits=Resources(mem="8Gi"),
)
def test_gpu():
    ...


@workflow
def wf():
    test_gpu()

# pyflyte run --remote -p flytesnacks -d development testgpu.py wf
k

katrina

06/01/2022, 5:45 PM
just to double check, are you overwriting the value of gpu-resource-name in your config?
r

Robin Kahlow

06/01/2022, 5:47 PM
no i'm not
k

katrina

06/01/2022, 5:51 PM
do get a default nvidia.com/gpu "exists" toleration regardless of my addition above).
is this only for the test_gpu task pod or all pods?
r

Robin Kahlow

06/01/2022, 5:53 PM
ill check, sec
the non-gpu ones don't get it no
k

katrina

06/01/2022, 5:59 PM
can we double check real quick that your config is being parsed? do you mind port-forwarding propeller
kubectl -n flyte port-forward deploy/flytepropeller 10254
and going to http://localhost:10254/config
r

Robin Kahlow

06/01/2022, 6:00 PM
sure
resource-tolerations is just null there, are we sure the previous comment was right in adding it to that plugins section and not to flytepropeller's for example?
k

katrina

06/01/2022, 6:05 PM
cool so something is not being set correctly in the yaml, however it should be in the plugins section, that does look correct
I'm not sure why you have the top-most k8s block though
r

Robin Kahlow

06/01/2022, 6:06 PM
i think that top level key in the values.yaml is used to name the yaml file in the generated configmap
r

Robin Kahlow

06/01/2022, 6:07 PM
ah
oh whoops, I was editing the wrong values file... sorry for wasting your time
k

katrina

06/01/2022, 6:18 PM
no problem!
10 Views