Hi I m trying to get a step running on GPUs I saw the recent Flyte #flyte-support

Hi, I’m trying to get a step running on GPUs. I sa...

big-notebook-82371

12/09/2023, 3:51 PM

Hi, I’m trying to get a step running on GPUs. I saw the recent update where you can use

from flytekit.extras.accelerators import T4

, and then use T4 under the

task

accelerator

parameter to request it. However, I realized that that output the following for AWS. I’m on GCP, is there an equivalent for that? Or if not, how can I go about requesting GPUs on GKE? Thanks

Copy code

affinity:                                                                                                                                                                                                                             
     nodeAffinity:                                                                                                                                                                                                                       
       requiredDuringSchedulingIgnoredDuringExecution:                                                                                                                                                                                   
         nodeSelectorTerms:                                                                                                                                                                                                              
         - matchExpressions:                                                                                                                                                                                                             
           - key: <http://k8s.amazonaws.com/accelerator|k8s.amazonaws.com/accelerator>                                                                                                                                                                                          
             operator: In                                                                                                                                                                                                                
             values:                                                                                                                                                                                                                     
             - nvidia-tesla-t4

big-notebook-82371

12/10/2023, 5:53 AM

So far, I’ve found this:

Copy code

plugins:
  k8s:
    gpu-device-node-label: <http://cloud.google.com/gke-accelerator|cloud.google.com/gke-accelerator>

which I’m guessing goes somewhere in the values-gcp-core.yaml, but so far at the root or under

configuration.inline

have not worked.

freezing-airport-6809

12/11/2023, 4:28 PM

This is a little new so not yet documented - Coming soon. Cc @freezing-boots-56761 / @many-wire-75890

big-notebook-82371

12/11/2023, 5:10 PM

Ok, gotcha, thanks. Is there a workaround or old method for using GPUs I could use in the meantime? Or @average-finland-92144 do you know if there’s a place in the terraform I could change to change that node label to

<http://cloud.google.com/gke-accelerator|cloud.google.com/gke-accelerator>

Thanks

👀 1

average-finland-92144

12/11/2023, 8:39 PM

@big-notebook-82371 ~~adding the following to the GKE module (here) should add the label to the nodes:~~ EDIT: still working on it

Copy code

node_pools_labels = {
    
    all = { "cloud.google.com/gke-accelerator" = "nvidia-tesla-t4"}
   default-node-pool = {
      default-node-pool = true
    }
}

Just created an Issue as adding more programmatic support for GPUs in the GCP modules is needed

big-notebook-82371

12/11/2023, 8:52 PM

Sounds good, I’ll check back. So, if I’m understanding right, this will add the label to node pool itself. But this won’t allow me to choose a specific node pool (which has gpus) inside of the

@task

decorator, is that right? Basically, right now it looks like when I do

@task(accelerator=T4)

, it adds

<http://k8s.amazonaws.com/accelerator|k8s.amazonaws.com/accelerator>

to the node affinity section, like above. I think if something can be changed so that that becomes

<http://cloud.google.com/gke-accelerator|cloud.google.com/gke-accelerator>

, that would solve my issue. But maybe that would be a change in flytekit instead?

big-notebook-82371

12/11/2023, 8:52 PM

But any way I can utilize a gpu node pool with existing functionality will work, I just can’t get them to work at all right now

freezing-boots-56761

12/11/2023, 8:54 PM

this is already supported.

freezing-boots-56761

12/11/2023, 8:55 PM

there is a propeller config you need to update. see the defaults here: https://github.com/flyteorg/flyte/blob/8cc422ebe5aa21aa20a75ca93362f27979941c64/flyteplugins/go/tasks/pluginmachinery/flytek8s/config/config.go#L58

freezing-boots-56761

12/11/2023, 8:55 PM

from the PR:

Copy code

plugins:
  k8s:
    gpu-device-node-label: <http://cloud.google.com/gke-accelerator|cloud.google.com/gke-accelerator>
    gpu-partition-size-node-label: <http://cloud.google.com/gke-gpu-partition-size|cloud.google.com/gke-gpu-partition-size>
    gpu-unpartitioned-toleration:
      effect: NoSchedule
      key: <http://cloud.google.com/gke-gpu-partition-size|cloud.google.com/gke-gpu-partition-size>
      operator: Equal
      value: DoesNotExist

freezing-boots-56761

12/11/2023, 8:56 PM

docs are still WIP, but the PR should capture the various use cases for flyte administrators 😅

freezing-boots-56761

12/11/2023, 8:57 PM

does that help @big-notebook-82371

big-notebook-82371

12/11/2023, 8:58 PM

I did find that PR, but I’m not quite sure where to add that code, and where the “FlytePropeller k8s plugin configuration” is. I used the

deploy_flyte

terraform setup for GCP, can I place that code somewhere in the

values-gcp-core.yaml

file to take effect? I tried at the root and under

configuration.inline

with no luck

freezing-boots-56761

12/11/2023, 9:00 PM

can you link the base values file you are using? should be trivial to add. @average-finland-92144: maybe you know off the top of your head where the k8s plugin goes in this values file?

big-notebook-82371

12/11/2023, 9:00 PM

For sure: https://github.com/unionai-oss/deploy-flyte/blob/main/environments/gcp/flyte-core/values-gcp-core.yaml

freezing-boots-56761

12/11/2023, 9:02 PM

ok its using the

flyte-core

chart. this is the

k8s

plugin block in the base values file: https://github.com/flyteorg/flyte/blob/8cc422ebe5aa21aa20a75ca93362f27979941c64/charts/flyte-core/values.yaml#L728

🙇🏽 1

freezing-boots-56761

12/11/2023, 9:06 PM

you'll just have to add the block at the same level in your values file @big-notebook-82371. will look something like this:

Copy code

k8s:
  plugins:
    k8s:
      gpu-device-node-label: <http://cloud.google.com/gke-accelerator|cloud.google.com/gke-accelerator>
      gpu-partition-size-node-label: <http://cloud.google.com/gke-gpu-partition-size|cloud.google.com/gke-gpu-partition-size>
      gpu-unpartitioned-toleration:
        effect: NoSchedule
        key: <http://cloud.google.com/gke-gpu-partition-size|cloud.google.com/gke-gpu-partition-size>
        operator: Equal
        value: DoesNotExist

🎯 1

🙇🏽 1

big-notebook-82371

12/11/2023, 9:07 PM

Ok, awesome, I think I’m seeing how that works. Sorry I haven’t used terraform/helm very much. So it looks like I would put that section you just posted under the

flyteconsole

key?

big-notebook-82371

12/11/2023, 9:08 PM

Well, if it matches, it would be

flyteconsole.configmap.k8s.plugins.k8s.gpu-device-node-label

average-finland-92144

12/11/2023, 9:08 PM

thanks! @big-notebook-82371 that'd be under

configmap

big-notebook-82371

12/11/2023, 9:09 PM

oh ok, just the root level configmap?

average-finland-92144

12/11/2023, 9:09 PM

at the same level of, say, this block: https://github.com/unionai-oss/deploy-flyte/blob/db3132ac910ddb8c68a643990ddf10eafb6163d3/environments/gcp/flyte-core/values-gcp-core.yaml#L238

🙏 1

big-notebook-82371

12/11/2023, 9:10 PM

Ok, perfect. I’ll give that a try! Thank you @freezing-boots-56761!

freezing-boots-56761

12/11/2023, 9:14 PM

👍

big-notebook-82371

12/13/2023, 3:14 PM

@freezing-boots-56761 quick follow up question. I just got this error on a pod with a gpu:

Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)

And I see on that MR:

We plan on leveraging ExtendedResources for other specialized resources such as shared memory (/dev/shm) in the future.

Is there a workaround for shm right now? Maybe using a V1PodSpec or something?

freezing-boots-56761

12/13/2023, 3:15 PM

yea you can use a pod template on the task decorator for now

freezing-boots-56761

12/13/2023, 3:15 PM

needs a volume in the pod spec and volume mount in the container spec. i’ll try and find an example

big-notebook-82371

12/13/2023, 3:16 PM

awesome, thank you

freezing-boots-56761

12/13/2023, 3:16 PM

https://github.com/unionai-oss/llm-fine-tuning/blob/e9b4119944ed67df52284e430acb455a2e423fea/fine_tuning/llm_fine_tuning.py#L358

freezing-boots-56761

12/13/2023, 3:17 PM

add it to the decorator like so; https://github.com/unionai-oss/llm-fine-tuning/blob/e9b4119944ed67df52284e430acb455a2e423fea/fine_tuning/llm_fine_tuning.py#L407

big-notebook-82371

12/13/2023, 3:19 PM

perfect, I’ll give that a try. Thank you!

big-notebook-82371

12/13/2023, 4:08 PM

Looks like I got the same error.. any debugging tips? Here’s my setup:

Copy code

gpu_pod_template = PodTemplate(
    primary_container_name="primary",
    pod_spec=V1PodSpec(
        containers=[
            V1Container(
                name="primary",
                volume_mounts=[V1VolumeMount(mount_path="/dev/shm", name="dshm")],
            ),
        ],
        volumes=[
            V1Volume(
                name="dshm",
                empty_dir=V1EmptyDirVolumeSource(medium="", size_limit="500Gi"),
            )
        ],
    ),
)


@task(
    container_image=image_spec,
    environment={"PYTHONPATH": "/root"},
    requests=requests,
    limits=limits,
    accelerator=T4,
    pod_template=gpu_pod_template,
)

And here are the sections from the pod spec yaml. Does the size limit have to match my memory requests for the machine? And does the primary_container_name need to be something different?

freezing-boots-56761

12/13/2023, 4:18 PM

hmm try setting medium to “Memory”

freezing-boots-56761

12/13/2023, 4:18 PM

and remove the size Limit

big-notebook-82371

12/13/2023, 4:37 PM

I think that worked!

🙌 1

numerous-sunset-21589

01/11/2024, 9:22 PM

@big-notebook-82371 these clips of pod_spec.yaml - where are they from - where do i put them?

big-notebook-82371

01/11/2024, 9:24 PM

Those were me viewing the yaml file of a running pod, not something that I specified. Its the result of the flyte code, if that makes sense

numerous-sunset-21589

01/11/2024, 9:25 PM

thx!

🫡 1

38 Views

Open in Slack

Previous Next