has anyone tried using <deepspeed> with flyte? fou...
# ask-the-community
n
has anyone tried using deepspeed with flyte? found this example from ray docs, wondering if we can use the kuberay integration
n
hi @Nan Qin this repo contains an example: https://github.com/unionai-oss/llm-fine-tuning
this task uses a
ds_config
argument using the pytorch Elastic plugin and pod templates
f
Ah interesting 🙂 Did using deepspeed with elastic pose any problems? Some time back, before using Flyte, I used deepspeed with kubeflow’s PyTorch Job and back then it worked without a problem. But I also saw that the project evolved quite a bit since then, hence the question.
n
the main thing was to add a pod template with a volume mount for shared memory: https://github.com/unionai-oss/llm-fine-tuning/blob/main/fine_tuning/llm_fine_tuning.py#L346-L362
but other than the typical manual tuning of OOM killed errors it worked fine 🙂
f
All hail pod templates đź‘Ś
Best feature ^^
n
I am curious about how the hostnames and ssh access between hosts are configured. Does PytorchJob/Elastic do some magic behind the scenes?
f
Yes, the kubeflow training operator sets the host names etc. as env vars in the pods.
This is all abstracted away from the user and just works 🙂
151 Views