https://flyte.org logo
#ask-the-community
Title
# ask-the-community
a

Andrew

12/09/2023, 3:51 PM
Hi, I’m trying to get a step running on GPUs. I saw the recent update where you can use
from flytekit.extras.accelerators import T4
, and then use T4 under the
task
accelerator
parameter to request it. However, I realized that that output the following for AWS. I’m on GCP, is there an equivalent for that? Or if not, how can I go about requesting GPUs on GKE? Thanks
Copy code
affinity:                                                                                                                                                                                                                             
     nodeAffinity:                                                                                                                                                                                                                       
       requiredDuringSchedulingIgnoredDuringExecution:                                                                                                                                                                                   
         nodeSelectorTerms:                                                                                                                                                                                                              
         - matchExpressions:                                                                                                                                                                                                             
           - key: <http://k8s.amazonaws.com/accelerator|k8s.amazonaws.com/accelerator>                                                                                                                                                                                          
             operator: In                                                                                                                                                                                                                
             values:                                                                                                                                                                                                                     
             - nvidia-tesla-t4
So far, I’ve found this:
Copy code
plugins:
  k8s:
    gpu-device-node-label: <http://cloud.google.com/gke-accelerator|cloud.google.com/gke-accelerator>
which I’m guessing goes somewhere in the values-gcp-core.yaml, but so far at the root or under
configuration.inline
have not worked.
k

Ketan (kumare3)

12/11/2023, 4:28 PM
This is a little new so not yet documented - Coming soon. Cc @jeev / @James Sutton
a

Andrew

12/11/2023, 5:10 PM
Ok, gotcha, thanks. Is there a workaround or old method for using GPUs I could use in the meantime? Or @David Espejo (he/him) do you know if there’s a place in the terraform I could change to change that node label to
<http://cloud.google.com/gke-accelerator|cloud.google.com/gke-accelerator>
Thanks
d

David Espejo (he/him)

12/11/2023, 8:39 PM
@Andrew adding the following to the GKE module (here) should add the label to the nodes: EDIT: still working on it
Copy code
node_pools_labels = {
    
    all = { "cloud.google.com/gke-accelerator" = "nvidia-tesla-t4"}
   default-node-pool = {
      default-node-pool = true
    }
}
Just created an Issue as adding more programmatic support for GPUs in the GCP modules is needed
a

Andrew

12/11/2023, 8:52 PM
Sounds good, I’ll check back. So, if I’m understanding right, this will add the label to node pool itself. But this won’t allow me to choose a specific node pool (which has gpus) inside of the
@task
decorator, is that right? Basically, right now it looks like when I do
@task(accelerator=T4)
, it adds
<http://k8s.amazonaws.com/accelerator|k8s.amazonaws.com/accelerator>
to the node affinity section, like above. I think if something can be changed so that that becomes
<http://cloud.google.com/gke-accelerator|cloud.google.com/gke-accelerator>
, that would solve my issue. But maybe that would be a change in flytekit instead?
But any way I can utilize a gpu node pool with existing functionality will work, I just can’t get them to work at all right now
j

jeev

12/11/2023, 8:54 PM
this is already supported.
from the PR:
Copy code
plugins:
  k8s:
    gpu-device-node-label: <http://cloud.google.com/gke-accelerator|cloud.google.com/gke-accelerator>
    gpu-partition-size-node-label: <http://cloud.google.com/gke-gpu-partition-size|cloud.google.com/gke-gpu-partition-size>
    gpu-unpartitioned-toleration:
      effect: NoSchedule
      key: <http://cloud.google.com/gke-gpu-partition-size|cloud.google.com/gke-gpu-partition-size>
      operator: Equal
      value: DoesNotExist
docs are still WIP, but the PR should capture the various use cases for flyte administrators 😅
does that help @Andrew
a

Andrew

12/11/2023, 8:58 PM
I did find that PR, but I’m not quite sure where to add that code, and where the “FlytePropeller k8s plugin configuration” is. I used the
deploy_flyte
terraform setup for GCP, can I place that code somewhere in the
values-gcp-core.yaml
file to take effect? I tried at the root and under
configuration.inline
with no luck
j

jeev

12/11/2023, 9:00 PM
can you link the base values file you are using? should be trivial to add. @David Espejo (he/him): maybe you know off the top of your head where the k8s plugin goes in this values file?
j

jeev

12/11/2023, 9:02 PM
ok its using the
flyte-core
chart. this is the
k8s
plugin block in the base values file: https://github.com/flyteorg/flyte/blob/8cc422ebe5aa21aa20a75ca93362f27979941c64/charts/flyte-core/values.yaml#L728
you'll just have to add the block at the same level in your values file @Andrew. will look something like this:
Copy code
k8s:
  plugins:
    k8s:
      gpu-device-node-label: <http://cloud.google.com/gke-accelerator|cloud.google.com/gke-accelerator>
      gpu-partition-size-node-label: <http://cloud.google.com/gke-gpu-partition-size|cloud.google.com/gke-gpu-partition-size>
      gpu-unpartitioned-toleration:
        effect: NoSchedule
        key: <http://cloud.google.com/gke-gpu-partition-size|cloud.google.com/gke-gpu-partition-size>
        operator: Equal
        value: DoesNotExist
a

Andrew

12/11/2023, 9:07 PM
Ok, awesome, I think I’m seeing how that works. Sorry I haven’t used terraform/helm very much. So it looks like I would put that section you just posted under the
flyteconsole
key?
Well, if it matches, it would be
flyteconsole.configmap.k8s.plugins.k8s.gpu-device-node-label
d

David Espejo (he/him)

12/11/2023, 9:08 PM
thanks! @Andrew that'd be under
configmap
a

Andrew

12/11/2023, 9:09 PM
oh ok, just the root level configmap?
a

Andrew

12/11/2023, 9:10 PM
Ok, perfect. I’ll give that a try! Thank you @jeev!
j

jeev

12/11/2023, 9:14 PM
👍
a

Andrew

12/13/2023, 3:14 PM
@jeev quick follow up question. I just got this error on a pod with a gpu:
Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)
And I see on that MR:
We plan on leveraging ExtendedResources for other specialized resources such as shared memory (/dev/shm) in the future.
Is there a workaround for shm right now? Maybe using a V1PodSpec or something?
j

jeev

12/13/2023, 3:15 PM
yea you can use a pod template on the task decorator for now
needs a volume in the pod spec and volume mount in the container spec. i’ll try and find an example
a

Andrew

12/13/2023, 3:16 PM
awesome, thank you
a

Andrew

12/13/2023, 3:19 PM
perfect, I’ll give that a try. Thank you!
Looks like I got the same error.. any debugging tips? Here’s my setup:
Copy code
gpu_pod_template = PodTemplate(
    primary_container_name="primary",
    pod_spec=V1PodSpec(
        containers=[
            V1Container(
                name="primary",
                volume_mounts=[V1VolumeMount(mount_path="/dev/shm", name="dshm")],
            ),
        ],
        volumes=[
            V1Volume(
                name="dshm",
                empty_dir=V1EmptyDirVolumeSource(medium="", size_limit="500Gi"),
            )
        ],
    ),
)


@task(
    container_image=image_spec,
    environment={"PYTHONPATH": "/root"},
    requests=requests,
    limits=limits,
    accelerator=T4,
    pod_template=gpu_pod_template,
)
And here are the sections from the pod spec yaml. Does the size limit have to match my memory requests for the machine? And does the primary_container_name need to be something different?
j

jeev

12/13/2023, 4:18 PM
hmm try setting medium to “Memory”
and remove the size Limit
a

Andrew

12/13/2023, 4:37 PM
I think that worked!
a

Alex Lyashok

01/11/2024, 9:22 PM
@Andrew these clips of pod_spec.yaml - where are they from - where do i put them?
a

Andrew

01/11/2024, 9:24 PM
Those were me viewing the yaml file of a running pod, not something that I specified. Its the result of the flyte code, if that makes sense
a

Alex Lyashok

01/11/2024, 9:25 PM
thx!
4 Views