https://flyte.org logo
#ask-the-community
Title
# ask-the-community
n

Nan Qin

06/26/2023, 7:18 PM
has anyone tried using deepspeed with flyte? found this example from ray docs, wondering if we can use the kuberay integration
n

Niels Bantilan

06/26/2023, 7:19 PM
hi @Nan Qin this repo contains an example: https://github.com/unionai-oss/llm-fine-tuning
this task uses a
ds_config
argument using the pytorch Elastic plugin and pod templates
f

Fabio Grätz

06/27/2023, 7:11 AM
Ah interesting 🙂 Did using deepspeed with elastic pose any problems? Some time back, before using Flyte, I used deepspeed with kubeflow’s PyTorch Job and back then it worked without a problem. But I also saw that the project evolved quite a bit since then, hence the question.
n

Niels Bantilan

06/27/2023, 2:34 PM
the main thing was to add a pod template with a volume mount for shared memory: https://github.com/unionai-oss/llm-fine-tuning/blob/main/fine_tuning/llm_fine_tuning.py#L346-L362
but other than the typical manual tuning of OOM killed errors it worked fine 🙂
f

Fabio Grätz

06/27/2023, 3:04 PM
All hail pod templates 👌
Best feature ^^
n

Nan Qin

06/27/2023, 3:09 PM
I am curious about how the hostnames and ssh access between hosts are configured. Does PytorchJob/Elastic do some magic behind the scenes?
f

Fabio Grätz

06/27/2023, 3:34 PM
Yes, the kubeflow training operator sets the host names etc. as env vars in the pods.
This is all abstracted away from the user and just works 🙂