Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

has anyone tried using <https://github.com/microsoft/DeepSpeed|deepspeed> with flyte? found <https://docs.ray.io/en/latest/ray-air/examples/gptj_deepspeed_fine_tuning.html|this example> from ray docs, wondering if we can use the kuberay integration

hi <@U04SVHHDEDA> this repo contains an example: <https://github.com/unionai-oss/llm-fine-tuning>

<https://github.com/unionai-oss/llm-fine-tuning/blob/main/fine_tuning/llm_fine_tuning.py#L384-L407|this task> uses a `ds_config` argument using the pytorch <https://docs.flyte.org/projects/flytekit/en/latest/plugins/generated/flytekitplugins.kfpytorch.Elastic.html|Elastic> plugin and pod templates

Ah interesting :slightly_smiling_face:
Did using deepspeed with elastic pose any problems? Some time back, before using Flyte, I used deepspeed with kubeflow’s PyTorch Job and back then it worked without a problem. But I also saw that the project evolved quite a bit since then, hence the question.

the main thing was to add a pod template with a volume mount for shared memory: <https://github.com/unionai-oss/llm-fine-tuning/blob/main/fine_tuning/llm_fine_tuning.py#L346-L362>

but other than the typical manual tuning of OOM killed errors it worked fine :slightly_smiling_face:

I am curious about how the hostnames and ssh access between hosts are configured. Does PytorchJob/Elastic do some magic behind the scenes?

Yes, the kubeflow training operator sets the host names etc. as env vars in the pods.

This is all abstracted away from the user and just works :slightly_smiling_face: