Hi, I’m trying to get a step running on GPUs. I sa...
# flyte-support
b
Hi, I’m trying to get a step running on GPUs. I saw the recent update where you can use
from flytekit.extras.accelerators import T4
, and then use T4 under the
task
accelerator
parameter to request it. However, I realized that that output the following for AWS. I’m on GCP, is there an equivalent for that? Or if not, how can I go about requesting GPUs on GKE? Thanks
Copy code
affinity:                                                                                                                                                                                                                             
     nodeAffinity:                                                                                                                                                                                                                       
       requiredDuringSchedulingIgnoredDuringExecution:                                                                                                                                                                                   
         nodeSelectorTerms:                                                                                                                                                                                                              
         - matchExpressions:                                                                                                                                                                                                             
           - key: <http://k8s.amazonaws.com/accelerator|k8s.amazonaws.com/accelerator>                                                                                                                                                                                          
             operator: In                                                                                                                                                                                                                
             values:                                                                                                                                                                                                                     
             - nvidia-tesla-t4
So far, I’ve found this:
Copy code
plugins:
  k8s:
    gpu-device-node-label: <http://cloud.google.com/gke-accelerator|cloud.google.com/gke-accelerator>
which I’m guessing goes somewhere in the values-gcp-core.yaml, but so far at the root or under
configuration.inline
have not worked.
f
This is a little new so not yet documented - Coming soon. Cc @freezing-boots-56761 / @many-wire-75890
b
Ok, gotcha, thanks. Is there a workaround or old method for using GPUs I could use in the meantime? Or @average-finland-92144 do you know if there’s a place in the terraform I could change to change that node label to
<http://cloud.google.com/gke-accelerator|cloud.google.com/gke-accelerator>
Thanks
👀 1
a
@big-notebook-82371 adding the following to the GKE module (here) should add the label to the nodes: EDIT: still working on it
Copy code
node_pools_labels = {
    
    all = { "cloud.google.com/gke-accelerator" = "nvidia-tesla-t4"}
   default-node-pool = {
      default-node-pool = true
    }
}
Just created an Issue as adding more programmatic support for GPUs in the GCP modules is needed
b
Sounds good, I’ll check back. So, if I’m understanding right, this will add the label to node pool itself. But this won’t allow me to choose a specific node pool (which has gpus) inside of the
@task
decorator, is that right? Basically, right now it looks like when I do
@task(accelerator=T4)
, it adds
<http://k8s.amazonaws.com/accelerator|k8s.amazonaws.com/accelerator>
to the node affinity section, like above. I think if something can be changed so that that becomes
<http://cloud.google.com/gke-accelerator|cloud.google.com/gke-accelerator>
, that would solve my issue. But maybe that would be a change in flytekit instead?
But any way I can utilize a gpu node pool with existing functionality will work, I just can’t get them to work at all right now
f
this is already supported.
from the PR:
Copy code
plugins:
  k8s:
    gpu-device-node-label: <http://cloud.google.com/gke-accelerator|cloud.google.com/gke-accelerator>
    gpu-partition-size-node-label: <http://cloud.google.com/gke-gpu-partition-size|cloud.google.com/gke-gpu-partition-size>
    gpu-unpartitioned-toleration:
      effect: NoSchedule
      key: <http://cloud.google.com/gke-gpu-partition-size|cloud.google.com/gke-gpu-partition-size>
      operator: Equal
      value: DoesNotExist
docs are still WIP, but the PR should capture the various use cases for flyte administrators 😅
does that help @big-notebook-82371
b
I did find that PR, but I’m not quite sure where to add that code, and where the “FlytePropeller k8s plugin configuration” is. I used the
deploy_flyte
terraform setup for GCP, can I place that code somewhere in the
values-gcp-core.yaml
file to take effect? I tried at the root and under
configuration.inline
with no luck
f
can you link the base values file you are using? should be trivial to add. @average-finland-92144: maybe you know off the top of your head where the k8s plugin goes in this values file?
f
ok its using the
flyte-core
chart. this is the
k8s
plugin block in the base values file: https://github.com/flyteorg/flyte/blob/8cc422ebe5aa21aa20a75ca93362f27979941c64/charts/flyte-core/values.yaml#L728
🙇🏽 1
you'll just have to add the block at the same level in your values file @big-notebook-82371. will look something like this:
Copy code
k8s:
  plugins:
    k8s:
      gpu-device-node-label: <http://cloud.google.com/gke-accelerator|cloud.google.com/gke-accelerator>
      gpu-partition-size-node-label: <http://cloud.google.com/gke-gpu-partition-size|cloud.google.com/gke-gpu-partition-size>
      gpu-unpartitioned-toleration:
        effect: NoSchedule
        key: <http://cloud.google.com/gke-gpu-partition-size|cloud.google.com/gke-gpu-partition-size>
        operator: Equal
        value: DoesNotExist
🎯 1
🙇🏽 1
b
Ok, awesome, I think I’m seeing how that works. Sorry I haven’t used terraform/helm very much. So it looks like I would put that section you just posted under the
flyteconsole
key?
Well, if it matches, it would be
flyteconsole.configmap.k8s.plugins.k8s.gpu-device-node-label
a
thanks! @big-notebook-82371 that'd be under
configmap
b
oh ok, just the root level configmap?
b
Ok, perfect. I’ll give that a try! Thank you @freezing-boots-56761!
f
👍
b
@freezing-boots-56761 quick follow up question. I just got this error on a pod with a gpu:
Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)
And I see on that MR:
We plan on leveraging ExtendedResources for other specialized resources such as shared memory (/dev/shm) in the future.
Is there a workaround for shm right now? Maybe using a V1PodSpec or something?
f
yea you can use a pod template on the task decorator for now
needs a volume in the pod spec and volume mount in the container spec. i’ll try and find an example
b
awesome, thank you
b
perfect, I’ll give that a try. Thank you!
Looks like I got the same error.. any debugging tips? Here’s my setup:
Copy code
gpu_pod_template = PodTemplate(
    primary_container_name="primary",
    pod_spec=V1PodSpec(
        containers=[
            V1Container(
                name="primary",
                volume_mounts=[V1VolumeMount(mount_path="/dev/shm", name="dshm")],
            ),
        ],
        volumes=[
            V1Volume(
                name="dshm",
                empty_dir=V1EmptyDirVolumeSource(medium="", size_limit="500Gi"),
            )
        ],
    ),
)


@task(
    container_image=image_spec,
    environment={"PYTHONPATH": "/root"},
    requests=requests,
    limits=limits,
    accelerator=T4,
    pod_template=gpu_pod_template,
)
And here are the sections from the pod spec yaml. Does the size limit have to match my memory requests for the machine? And does the primary_container_name need to be something different?
f
hmm try setting medium to “Memory”
and remove the size Limit
b
I think that worked!
🙌 1
n
@big-notebook-82371 these clips of pod_spec.yaml - where are they from - where do i put them?
b
Those were me viewing the yaml file of a running pod, not something that I specified. Its the result of the flyte code, if that makes sense
n
thx!
🫡 1