Hi how can I run a ContainerTask on a GPU enabled node and s Flyte #flyte-support

Hi, how can I run a ContainerTask on a GPU enabled...

quaint-diamond-37493

10/19/2022, 7:21 PM

Hi, how can I run a ContainerTask on a GPU enabled node and set

runtimeClassName

nvidia

so that it can actually use the GPU? I added

requests=Resources(gpu="1")

but can I add the runtimeClassName?

freezing-airport-6809

10/19/2022, 8:52 PM

@quaint-diamond-37493 this is interesting. We do not support runtimeClassName at the “container” level, but there are 2 options 1. Default runtime class name using pod templates. These are the templates that are used to run all container “raw or other” tasks 2. Use Pod plugin

rapid-vegetable-16315

10/20/2022, 6:35 AM

We did it in the following way, not sure if it's most ideal (since you don't get to pick a specific GPU type), but it seems to work for us:

Copy code

pod_resources = Resources(cpu="3", mem="20Gi", gpu="1")
export_terrain_texture_container_task = CustomContainerTask(
    requests=pod_resources,
    limits=pod_resources,
)

class CustomContainerTask(ContainerTask):
    def __init__(
        self,
        requests: Optional[Resources] = None,
        limits: Optional[Resources] = None,
        **kwargs: Any,
    ):
        super().__init__(
            <<stuff>>
            requests=requests,
            limits=limits,
            **kwargs,
        )

    def get_container(self, settings: SerializationSettings) -> _task_model.Container:
        env = {**settings.env, **self.environment} if self.environment else settings.env
        return _get_container_definition(
            image=self._image,
            command=self._cmd,
            args=self._args,
            data_loading_config=_task_model.DataLoadingConfig(
                input_path=self._input_data_dir,
                output_path=self._output_data_dir,
                format=self._md_format.value,
                enabled=True,
                io_strategy=self._io_strategy.value if self._io_strategy else None,
            ),
            environment=env,
            cpu_request=self.resources.requests.cpu,
            cpu_limit=self.resources.limits.cpu,
            memory_request=self.resources.requests.mem,
            memory_limit=self.resources.limits.mem,
            gpu_request=self.resources.requests.gpu,
            gpu_limit=self.resources.limits.gpu,
            ephemeral_storage_request=self.resources.requests.ephemeral_storage,
            ephemeral_storage_limit=self.resources.limits.ephemeral_storage,
        )

rapid-vegetable-16315

10/20/2022, 6:37 AM

Where imo the important part is the override for

get_container

, since the original does not include the passthrough of the GPU resources. i.e. these lines:

Copy code

gpu_request=self.resources.requests.gpu,
            gpu_limit=self.resources.limits.gpu,

❤️ 1

quaint-diamond-37493

10/20/2022, 7:44 AM

@rapid-vegetable-16315 thanks for the hint about passing on gpu resources, that sounds like a bug though.

quaint-diamond-37493

10/20/2022, 7:47 AM

@freezing-airport-6809 I'll look into pod templates, but what I need in the end is that when I request a GPU, limit

<http://nvidia.com/gpu|nvidia.com/gpu>: "1"

and

runtimeClassName: nvidia

are added to the container, and if I don't request a GPU, neither are.

freezing-airport-6809

10/20/2022, 2:53 PM

@quaint-diamond-37493 also do propose a better UX, remember one goal, we want to make most tasks simple - without need of kubernetes pod specifics. This makes it possible to do fun optimizations (hopefully next year)

quaint-diamond-37493

10/20/2022, 4:16 PM

should I open an issue for the workaround that @rapid-vegetable-16315 posted? IMHO this really should not be needed and is a bug..

quaint-diamond-37493

10/20/2022, 4:22 PM

For translating requests with GPU to kubernetes resource requests, tolerations, etc.. how is that currently done? I didn't get how a gpu request is turned into

nvidia/gpu

or is that "hardcoded" in favor of nvidia gpus? Regarding tolerations, I already setup the

ExtendedResourceToleration

admission controller which will add tolerations automatically. So in my case I don't have to handle that in flyte separately. One other idea for the

runtimeClassName

could be to setup something similar which will add that automatically in k8s, basically a customized ExtendedResourceToleration admission controller or just an additional one...

quaint-diamond-37493

10/20/2022, 5:36 PM

@rapid-vegetable-16315 @freezing-airport-6809 https://github.com/flyteorg/flytekit/pull/1249

quaint-diamond-37493

10/20/2022, 8:23 PM

@freezing-airport-6809 thanks for the pointers! I got it to work with the above fix for flytekit and a custom PodTemplate that sets

runtimeClassName: nvidia

This works for now and I can run GPU tasks, but also means that flyte tasks will always run with nvidia runtime (and hence only be schedules to nodes which have this runtime and hence a GPU).

flyte-nvidia-pod-template.yaml

freezing-airport-6809

10/20/2022, 8:36 PM

Hmm ya this is not good. Cc @boundless-pizza-95864 & @quaint-diamond-37493 and @dry-teacher-15656 we should file an issue for this. This definitely seems something we can do in the higher level container api to make things better

👍 1

➕ 1

162 Views

Open in Slack

Previous Next