big-notebook-82371
12/09/2023, 3:51 PMfrom flytekit.extras.accelerators import T4
, and then use T4 under the task
accelerator
parameter to request it. However, I realized that that output the following for AWS. I’m on GCP, is there an equivalent for that? Or if not, how can I go about requesting GPUs on GKE? Thanks
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: <http://k8s.amazonaws.com/accelerator|k8s.amazonaws.com/accelerator>
operator: In
values:
- nvidia-tesla-t4
big-notebook-82371
12/10/2023, 5:53 AMplugins:
k8s:
gpu-device-node-label: <http://cloud.google.com/gke-accelerator|cloud.google.com/gke-accelerator>
which I’m guessing goes somewhere in the values-gcp-core.yaml, but so far at the root or under configuration.inline
have not worked.freezing-airport-6809
big-notebook-82371
12/11/2023, 5:10 PM<http://cloud.google.com/gke-accelerator|cloud.google.com/gke-accelerator>
Thanksaverage-finland-92144
12/11/2023, 8:39 PMnode_pools_labels = {
all = { "cloud.google.com/gke-accelerator" = "nvidia-tesla-t4"}
default-node-pool = {
default-node-pool = true
}
}
Just created an Issue as adding more programmatic support for GPUs in the GCP modules is neededbig-notebook-82371
12/11/2023, 8:52 PM@task
decorator, is that right?
Basically, right now it looks like when I do @task(accelerator=T4)
, it adds <http://k8s.amazonaws.com/accelerator|k8s.amazonaws.com/accelerator>
to the node affinity section, like above. I think if something can be changed so that that becomes <http://cloud.google.com/gke-accelerator|cloud.google.com/gke-accelerator>
, that would solve my issue.
But maybe that would be a change in flytekit instead?big-notebook-82371
12/11/2023, 8:52 PMfreezing-boots-56761
freezing-boots-56761
freezing-boots-56761
plugins:
k8s:
gpu-device-node-label: <http://cloud.google.com/gke-accelerator|cloud.google.com/gke-accelerator>
gpu-partition-size-node-label: <http://cloud.google.com/gke-gpu-partition-size|cloud.google.com/gke-gpu-partition-size>
gpu-unpartitioned-toleration:
effect: NoSchedule
key: <http://cloud.google.com/gke-gpu-partition-size|cloud.google.com/gke-gpu-partition-size>
operator: Equal
value: DoesNotExist
freezing-boots-56761
freezing-boots-56761
big-notebook-82371
12/11/2023, 8:58 PMdeploy_flyte
terraform setup for GCP, can I place that code somewhere in the values-gcp-core.yaml
file to take effect? I tried at the root and under configuration.inline
with no luckfreezing-boots-56761
big-notebook-82371
12/11/2023, 9:00 PMfreezing-boots-56761
flyte-core
chart. this is the k8s
plugin block in the base values file: https://github.com/flyteorg/flyte/blob/8cc422ebe5aa21aa20a75ca93362f27979941c64/charts/flyte-core/values.yaml#L728freezing-boots-56761
k8s:
plugins:
k8s:
gpu-device-node-label: <http://cloud.google.com/gke-accelerator|cloud.google.com/gke-accelerator>
gpu-partition-size-node-label: <http://cloud.google.com/gke-gpu-partition-size|cloud.google.com/gke-gpu-partition-size>
gpu-unpartitioned-toleration:
effect: NoSchedule
key: <http://cloud.google.com/gke-gpu-partition-size|cloud.google.com/gke-gpu-partition-size>
operator: Equal
value: DoesNotExist
big-notebook-82371
12/11/2023, 9:07 PMflyteconsole
key?big-notebook-82371
12/11/2023, 9:08 PMflyteconsole.configmap.k8s.plugins.k8s.gpu-device-node-label
average-finland-92144
12/11/2023, 9:08 PMconfigmap
big-notebook-82371
12/11/2023, 9:09 PMaverage-finland-92144
12/11/2023, 9:09 PMbig-notebook-82371
12/11/2023, 9:10 PMfreezing-boots-56761
big-notebook-82371
12/13/2023, 3:14 PMUnexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)
And I see on that MR: We plan on leveraging ExtendedResources for other specialized resources such as shared memory (/dev/shm) in the future.
Is there a workaround for shm right now? Maybe using a V1PodSpec or something?freezing-boots-56761
freezing-boots-56761
big-notebook-82371
12/13/2023, 3:16 PMfreezing-boots-56761
freezing-boots-56761
big-notebook-82371
12/13/2023, 3:19 PMbig-notebook-82371
12/13/2023, 4:08 PMgpu_pod_template = PodTemplate(
primary_container_name="primary",
pod_spec=V1PodSpec(
containers=[
V1Container(
name="primary",
volume_mounts=[V1VolumeMount(mount_path="/dev/shm", name="dshm")],
),
],
volumes=[
V1Volume(
name="dshm",
empty_dir=V1EmptyDirVolumeSource(medium="", size_limit="500Gi"),
)
],
),
)
@task(
container_image=image_spec,
environment={"PYTHONPATH": "/root"},
requests=requests,
limits=limits,
accelerator=T4,
pod_template=gpu_pod_template,
)
And here are the sections from the pod spec yaml. Does the size limit have to match my memory requests for the machine? And does the primary_container_name need to be something different?freezing-boots-56761
freezing-boots-56761
big-notebook-82371
12/13/2023, 4:37 PMnumerous-sunset-21589
01/11/2024, 9:22 PMbig-notebook-82371
01/11/2024, 9:24 PMnumerous-sunset-21589
01/11/2024, 9:25 PM