Hello everyone, we are currently trying to leverag...
# flyte-support
f
Hello everyone, we are currently trying to leverage GPU nodes on Azure, and we have not managed to find a way how to apply tolerations to the tasks pods. We have added some configuration based on this documentation: https://docs.flyte.org/en/latest/user_guide/productionizing/configuring_access_to_gpus.html However, tolerations are still not being applied to the pods. This is our current configuration:
Copy code
configuration:
    inline:
      task_resources:
        defaults:
          cpu: 500m
          memory: 1Gi
          <http://nvidia.com/gpu|nvidia.com/gpu>: "1"
        limits:
          cpu: 2
          memory: 2Gi
          <http://nvidia.com/gpu|nvidia.com/gpu>: "1"
      plugins:
        k8s:
          inject-finalizer: true
          default-memory: 200Gi
          default-cpus: "20"
          resource-tolerations:
            - gpu:
              - key: "gpu"
                operator: "Equal"
                value: "true"
                effect: "NoSchedule"
          gpu-resource-name: "<http://nvidia.com/gpu|nvidia.com/gpu>"
          default-node-selector:
            poolname: gpu
We can see it's present in the main Flyte pod config file but the only thing that is being applied to the task pod is the nodeSelector
Copy code
nodeSelector:
    poolname: gpu
I really appreciate any help, thank you!
a
Hey Jakub! In this PR (https://github.com/unionai-oss/deploy-flyte/pull/23) we're adding support for GPU consumption to a set of Terraform files that serve as a Flyte reference implementation on Azure, maybe something there is helpful for you. BTW I'm working on getting the GPU documentation for Flyte updated but in the meantime...
Whatever you define under
resource-tolerations
should match an ExtendedResource advertised by the device driver plugin on your K8s nodes. For NVIDIA accelerators, it's typically
<http://nvidia.com/gpu|nvidia.com/gpu>
instead of just
gpu
Honestly I'm not quite sure what the effect of
gpu-resource-name
is, I haven't needed it and have been able to use GPUs on tainted nodes in Azure
I haven't tried it but maybe it's helpful when the Extended Resource name is different. In any case, when you request a GPU in the task decorator or as part of your resource defaults, both the advertised and configured device name have to match, otherwise I don't see how it would apply tolerations. In summary, try switching
gpu
to
<http://nvidia.com/gpu|nvidia.com/gpu>
under
resource-tolerations
Also bear in mind that even if you don't set this list, flytepropeller should inject a
<http://nvidia.com/gpu:NoSchedule|nvidia.com/gpu:NoSchedule>
toleration when you request a GPU device. The list is useful if your GPU nodes have additional taints