@polite-ability-4005 and folks from linkedin run distributed training, but have a custom checkpointing logic. I have talked to them about potentially collaborating on this. there was some interest. Byron can you remind me who it was from your team?
it also seems pytorch now has better support for checkpointing natively -
https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html we can store the state store and the dataloader state