Hello everyone we are currently trying to leverage GPU nodes Flyte #flyte-support

Hello everyone, we are currently trying to leverag...

fancy-hamburger-89099

06/18/2024, 7:47 AM

Hello everyone, we are currently trying to leverage GPU nodes on Azure, and we have not managed to find a way how to apply tolerations to the tasks pods. We have added some configuration based on this documentation: https://docs.flyte.org/en/latest/user_guide/productionizing/configuring_access_to_gpus.html However, tolerations are still not being applied to the pods. This is our current configuration:

Copy code

configuration:
    inline:
      task_resources:
        defaults:
          cpu: 500m
          memory: 1Gi
          <http://nvidia.com/gpu|nvidia.com/gpu>: "1"
        limits:
          cpu: 2
          memory: 2Gi
          <http://nvidia.com/gpu|nvidia.com/gpu>: "1"
      plugins:
        k8s:
          inject-finalizer: true
          default-memory: 200Gi
          default-cpus: "20"
          resource-tolerations:
            - gpu:
              - key: "gpu"
                operator: "Equal"
                value: "true"
                effect: "NoSchedule"
          gpu-resource-name: "<http://nvidia.com/gpu|nvidia.com/gpu>"
          default-node-selector:
            poolname: gpu

We can see it's present in the main Flyte pod config file but the only thing that is being applied to the task pod is the nodeSelector

Copy code

nodeSelector:
    poolname: gpu

I really appreciate any help, thank you!

average-finland-92144

06/18/2024, 3:16 PM

Hey Jakub! In this PR (https://github.com/unionai-oss/deploy-flyte/pull/23) we're adding support for GPU consumption to a set of Terraform files that serve as a Flyte reference implementation on Azure, maybe something there is helpful for you. BTW I'm working on getting the GPU documentation for Flyte updated but in the meantime...

average-finland-92144

06/18/2024, 3:18 PM

Whatever you define under

resource-tolerations

should match an ExtendedResource advertised by the device driver plugin on your K8s nodes. For NVIDIA accelerators, it's typically

<http://nvidia.com/gpu|nvidia.com/gpu>

instead of just

gpu

average-finland-92144

06/18/2024, 3:22 PM

Honestly I'm not quite sure what the effect of

gpu-resource-name

is, I haven't needed it and have been able to use GPUs on tainted nodes in Azure

average-finland-92144

06/18/2024, 3:37 PM

I haven't tried it but maybe it's helpful when the Extended Resource name is different. In any case, when you request a GPU in the task decorator or as part of your resource defaults, both the advertised and configured device name have to match, otherwise I don't see how it would apply tolerations. In summary, try switching

gpu

<http://nvidia.com/gpu|nvidia.com/gpu>

under

resource-tolerations

Also bear in mind that even if you don't set this list, flytepropeller should inject a

<http://nvidia.com/gpu:NoSchedule|nvidia.com/gpu:NoSchedule>

toleration when you request a GPU device. The list is useful if your GPU nodes have additional taints

21 Views

Open in Slack

Previous Next