Hi, how can I run a ContainerTask on a GPU enabled...
# ask-the-community
f
Hi, how can I run a ContainerTask on a GPU enabled node and set
runtimeClassName
to
nvidia
so that it can actually use the GPU? I added
requests=Resources(gpu="1")
but can I add the runtimeClassName?
k
@Felix Ruess this is interesting. We do not support runtimeClassName at the “container” level, but there are 2 options 1. Default runtime class name using pod templates. These are the templates that are used to run all container “raw or other” tasks 2. Use Pod plugin
m
We did it in the following way, not sure if it's most ideal (since you don't get to pick a specific GPU type), but it seems to work for us:
Copy code
pod_resources = Resources(cpu="3", mem="20Gi", gpu="1")
export_terrain_texture_container_task = CustomContainerTask(
    requests=pod_resources,
    limits=pod_resources,
)

class CustomContainerTask(ContainerTask):
    def __init__(
        self,
        requests: Optional[Resources] = None,
        limits: Optional[Resources] = None,
        **kwargs: Any,
    ):
        super().__init__(
            <<stuff>>
            requests=requests,
            limits=limits,
            **kwargs,
        )

    def get_container(self, settings: SerializationSettings) -> _task_model.Container:
        env = {**settings.env, **self.environment} if self.environment else settings.env
        return _get_container_definition(
            image=self._image,
            command=self._cmd,
            args=self._args,
            data_loading_config=_task_model.DataLoadingConfig(
                input_path=self._input_data_dir,
                output_path=self._output_data_dir,
                format=self._md_format.value,
                enabled=True,
                io_strategy=self._io_strategy.value if self._io_strategy else None,
            ),
            environment=env,
            cpu_request=self.resources.requests.cpu,
            cpu_limit=self.resources.limits.cpu,
            memory_request=self.resources.requests.mem,
            memory_limit=self.resources.limits.mem,
            gpu_request=self.resources.requests.gpu,
            gpu_limit=self.resources.limits.gpu,
            ephemeral_storage_request=self.resources.requests.ephemeral_storage,
            ephemeral_storage_limit=self.resources.limits.ephemeral_storage,
        )
Where imo the important part is the override for
get_container
, since the original does not include the passthrough of the GPU resources. i.e. these lines:
Copy code
gpu_request=self.resources.requests.gpu,
            gpu_limit=self.resources.limits.gpu,
f
@Maarten de Jong thanks for the hint about passing on gpu resources, that sounds like a bug though.
@Ketan (kumare3) I'll look into pod templates, but what I need in the end is that when I request a GPU, limit
<http://nvidia.com/gpu|nvidia.com/gpu>: "1"
and
runtimeClassName: nvidia
are added to the container, and if I don't request a GPU, neither are.
k
@Felix Ruess also do propose a better UX, remember one goal, we want to make most tasks simple - without need of kubernetes pod specifics. This makes it possible to do fun optimizations (hopefully next year)
f
should I open an issue for the workaround that @Maarten de Jong posted? IMHO this really should not be needed and is a bug..
For translating requests with GPU to kubernetes resource requests, tolerations, etc.. how is that currently done? I didn't get how a gpu request is turned into
nvidia/gpu
or is that "hardcoded" in favor of nvidia gpus? Regarding tolerations, I already setup the
ExtendedResourceToleration
admission controller which will add tolerations automatically. So in my case I don't have to handle that in flyte separately. One other idea for the
runtimeClassName
could be to setup something similar which will add that automatically in k8s, basically a customized ExtendedResourceToleration admission controller or just an additional one...
@Maarten de Jong @Ketan (kumare3) https://github.com/flyteorg/flytekit/pull/1249
@Ketan (kumare3) thanks for the pointers! I got it to work with the above fix for flytekit and a custom PodTemplate that sets
runtimeClassName: nvidia
This works for now and I can run GPU tasks, but also means that flyte tasks will always run with nvidia runtime (and hence only be schedules to nodes which have this runtime and hence a GPU).
k
Hmm ya this is not good. Cc @Sören Brunk & @Felix Ruess and @Alireza we should file an issue for this. This definitely seems something we can do in the higher level container api to make things better
160 Views