Does running the flyte control plane on spot instances raise any concerns? Is anyone running their control plane on spot instances and has this been a source of any difficulties?
10/26/2022, 8:12 PM
and potential liveness / progress concerns. But not too bad. Preferably put it on reserved machines
10/26/2022, 8:24 PM
10/26/2022, 8:27 PM
We are, but haven't had many issues. I've been definitely considering moving the control plane to not use spot instances and only have the workloads on spot; I definitely think that the control plane in the long term should not be on spot instances, but trying to save some $$ 😅
10/26/2022, 8:41 PM
It’s probably also a little more code to use intratask checkpoints in the case e.g. long-running model training.
No one asked, but here’s a more real-world ML example of using this feature: link 🙃