Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

Are there users that run multi-node distributed training on Flyte that is fault-tolerance? (By using Intratask checkpoints with elastic torch?)

I spoke to some scientists that do large scale distributed training, and all they want is a way to spin up spot instances and have it keep training even when some GPU go down.

<@U042Z2S8268> and folks from linkedin run distributed training, but have a custom checkpointing logic. I have talked to them about potentially collaborating on this. there was some interest. Byron can you remind me who it was from your team?

it also seems pytorch now has better support for checkpointing natively - <https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html> we can store the state store and the dataloader state