Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

Hi team, I need some help understanding how replicas work with pytorch/mpi/etc in flyte? I was looking at the outputDir code and looks like all the replicas (shards really) are uploading the data to the same path, which would mean we are potentially losing data we output. I am trying check if there is a piece of code that merges this data but I don't see any synchronization code. Would anyone be able to give me some pointers regarding that?

what do you mean by replicas?  can you share a snippet of your pytorch code?

actually it is fine right now. With multi node pytorch, we had multiple workers and each worker had sharded data. since this was done by the operators instead of flyte directly, the pod names were of the form execid-n0-0-worker-x. this caused the path to be created as execid/n0/0/outputs.pb for all workers. I was later told that nccl did all the communication for all the workers and hence outputs.pb wasn't required. so this is a non-issue for now. Thanks for following up! I don't have the pytorch code myself. I'll share when i get it