Hi I d like to specify the amount and quantity of GPUs when Flyte #ml-and-mlops-questions

Hi, I'd like to specify the amount and quantity of...

silly-book-73230

01/13/2025, 12:04 PM

Hi, I'd like to specify the amount and quantity of GPUs when running tasks in the Google Kubernetes Engine. Currently, I do that with a task specification like this:

Copy code

@task(
    requests=Resources(cpu="8", mem="54Gi", gpu="2"),
    limits=Resources(cpu="100", mem="1Ti"),
    pod_template=PodTemplate(
        pod_spec=V1PodSpec(
            containers=[
                V1Container(
                    name="primary",
                ),
            ],
            node_selector={
                "cloud.google.com/gke-accelerator": "nvidia-l4",
                "cloud.google.com/gke-accelerator-count": "2",
            },
        )
    ),
)

I see that Flyte also has a features for selecting GPUs: https://docs.flyte.org/en/latest/api/flytekit/extras.accelerators.html However, if I remove the pod_template and just add the accelerator kwarg, then the flytepropellor gives the following error:

Copy code

│ E0113 12:02:55.686281       1 workers.go:103] error syncing '-': failed at Node[-]. Runt │
│ imeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [GKE Warden constraints violat │
│ ons[] failed to create resource, caused by: admission webhook "warden-validating.common-webhooks.networking.gke.io" denied the request: GKE W │
│ arden rejected the request because it violates one or more constraints.                                                                       │
│ Violations details: {"[denied by autogke-gpu-limitation]":["When requesting 'nvidia.com/gpu' resources, you must specify either node selector │
│  'cloud.google.com/gke-accelerator' with accelerator type or node selector 'cloud.google.com/compute-class' with existing custom compute clas │
│ s which has at least one GPU priority rule."]}

This suggests that the right GKE config is not properly set by providing the accelerator kwarg. Is this supposed to happen? If not, what is the point of the accelerator kwarg?

gentle-tomato-480

01/13/2025, 12:43 PM

Hey Pim, lemme share how I use it

gentle-tomato-480

01/13/2025, 12:44 PM

Here's a task that uses a L4 GPU on GKE and requests some other resources.

@task(container_image=image_spec, requests=Resources(cpu="1", mem="2G"), accelerator=GPUAccelerator("nvidia-l4"), limits=Resources(cpu="4", mem="7G", gpu="1"), timeout=timedelta(minutes=10))

gentle-tomato-480

01/13/2025, 12:48 PM

Also important to know is that your flyte install should have these in the config: https://github.com/flyteorg/flyte/blob/b8fb68df84675f25befea766a19f392fb06ae7e6/charts/flyte-binary/gke-starter.yaml#L79-L90

gentle-tomato-480

01/13/2025, 12:55 PM

Also as you can see the

gpu

request should be set in the

limits

instead of the

requests

https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#using-device-plugins

silly-book-73230

01/13/2025, 1:54 PM

ah, thanks, that's helpful!

gentle-tomato-480

01/13/2025, 1:56 PM

Works at least with

flytekit

version

1.13.7

and

flyte-binary

chart version

1.13.2

freezing-airport-6809

01/13/2025, 3:05 PM

Good thing to add to docs

silly-book-73230

01/13/2025, 3:39 PM

Does this then automatically set "cloud.google.com/gke-accelerator-count" ? Shouldn't we also specify that label to Flyte as well?

silly-book-73230

01/13/2025, 3:40 PM

Or is this option not necessary?

gentle-tomato-480

01/13/2025, 3:40 PM

No it won't. But I'm not sure if it's necessary unless you use the count to select certain node pools

silly-book-73230

01/13/2025, 3:40 PM

Ah okay ty

gentle-tomato-480

01/13/2025, 3:41 PM

Otherwise k8s using the

gpu

limit will be smart enough to find you a GPU that fulfills your resource request

gentle-tomato-480

01/13/2025, 3:42 PM

So if you have only 1 node pool that has machines with 8 CPU, 54GB memory and 2 L4s, it's almost guaranteed to be scheduled there even without the count label

11 Views

Open in Slack

Previous Next