has anyone tried using <deepspeed> with flyte? fou...
# flyte-support
s
has anyone tried using deepspeed with flyte? found this example from ray docs, wondering if we can use the kuberay integration
b
hi @shy-accountant-549 this repo contains an example: https://github.com/unionai-oss/llm-fine-tuning
👀 1
this task uses a
ds_config
argument using the pytorch Elastic plugin and pod templates
👀 1
c
Ah interesting 🙂 Did using deepspeed with elastic pose any problems? Some time back, before using Flyte, I used deepspeed with kubeflow’s PyTorch Job and back then it worked without a problem. But I also saw that the project evolved quite a bit since then, hence the question.
b
the main thing was to add a pod template with a volume mount for shared memory: https://github.com/unionai-oss/llm-fine-tuning/blob/main/fine_tuning/llm_fine_tuning.py#L346-L362
👌 1
but other than the typical manual tuning of OOM killed errors it worked fine 🙂
c
All hail pod templates 👌
Best feature ^^
s
I am curious about how the hostnames and ssh access between hosts are configured. Does PytorchJob/Elastic do some magic behind the scenes?
c
Yes, the kubeflow training operator sets the host names etc. as env vars in the pods.
This is all abstracted away from the user and just works 🙂
👍 2
157 Views