Thread
#announcements
    Robin Kahlow

    Robin Kahlow

    3 months ago
    Trying to use GPUs, I added a tolerations section as described here https://docs.flyte.org/projects/cookbook/en/stable/auto/deployment/configure_use_gpus.html (and in a previous comment where it was clarified where to apply this https://flyte-org.slack.com/archives/CNMKCU6FR/p1651591056890689?thread_ts=1651584781.772139&cid=CNMKCU6FR), ie.
    # -- Kubernetes specific Flyte configuration
      k8s:
        plugins:
          # -- Configuration section for all K8s specific plugins [Configuration structure](<https://pkg.go.dev/github.com/lyft/flyteplugins/go/tasks/pluginmachinery/flytek8s/config>)
          k8s:
            default-env-vars: []
            #  DEFAULT_ENV_VAR: VALUE
            default-cpus: 100m
            default-memory: 100Mi
    
            resource-tolerations:
              - <http://nvidia.com/gpu|nvidia.com/gpu>:
                - key: "key1"
                  operator: "Equal"
                  value: "value1"
                  effect: "NoSchedule"
    and I applied that with helm and also tried restarting the Flyte pods (kubectl rollout restart deploy), but the pods that get started by Flyte workflows don't get these tolerations (although they do get a default nvidia.com/gpu "exists" toleration regardless of my addition above). Anything I'm doing wrong?
    also tried putting resource-tolerations one level higher so its under plugins, but not working either
    k

    katrina

    3 months ago
    hey @Robin Kahlow just to double check, the tasks with non-zero gpu resource requests are also not getting the tolerations?
    Robin Kahlow

    Robin Kahlow

    3 months ago
    Hey! Yes I tried it with a task thats requesting 1 gpu
    from flytekit import task, workflow, Resources
    
    
    @task(
        requests=Resources(gpu="1", cpu="2"),
        limits=Resources(mem="8Gi"),
    )
    def test_gpu():
        ...
    
    
    @workflow
    def wf():
        test_gpu()
    
    # pyflyte run --remote -p flytesnacks -d development testgpu.py wf
    k

    katrina

    3 months ago
    just to double check, are you overwriting the value of gpu-resource-name in your config?
    Robin Kahlow

    Robin Kahlow

    3 months ago
    no i'm not
    k

    katrina

    3 months ago
    do get a default nvidia.com/gpu "exists" toleration regardless of my addition above).
    is this only for the test_gpu task pod or all pods?
    Robin Kahlow

    Robin Kahlow

    3 months ago
    ill check, sec
    the non-gpu ones don't get it no
    k

    katrina

    3 months ago
    can we double check real quick that your config is being parsed? do you mind port-forwarding propeller
    kubectl -n flyte port-forward deploy/flytepropeller 10254
    and going to http://localhost:10254/config
    Robin Kahlow

    Robin Kahlow

    3 months ago
    sure
    resource-tolerations is just null there, are we sure the previous comment was right in adding it to that plugins section and not to flytepropeller's for example?
    k

    katrina

    3 months ago
    cool so something is not being set correctly in the yaml, however it should be in the plugins section, that does look correct
    I'm not sure why you have the top-most k8s block though
    Robin Kahlow

    Robin Kahlow

    3 months ago
    i think that top level key in the values.yaml is used to name the yaml file in the generated configmap
    Robin Kahlow

    Robin Kahlow

    3 months ago
    ah
    oh whoops, I was editing the wrong values file... sorry for wasting your time
    k

    katrina

    3 months ago
    no problem!