Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

Hi!
Just ran into an issue when training a model in huggingface trainer w/ multiple GPUs in the flyte task.
Pinned it down to that the shared memory thats accessed by the huggingface trainer runs out (i think for us it had defaulted to about 60mb)
The solution we are thinking of it to volume in more storage to the /dev/shm however finding it a bit tricky to find where to do that