Are there users that run multi-node distributed tr...
# flyte-support
f
Are there users that run multi-node distributed training on Flyte that is fault-tolerance? (By using Intratask checkpoints with elastic torch?) I spoke to some scientists that do large scale distributed training, and all they want is a way to spin up spot instances and have it keep training even when some GPU go down.
f
@polite-ability-4005 and folks from linkedin run distributed training, but have a custom checkpointing logic. I have talked to them about potentially collaborating on this. there was some interest. Byron can you remind me who it was from your team? it also seems pytorch now has better support for checkpointing natively - https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html we can store the state store and the dataloader state