Hello, does anyone know what’s the expected behaviour when Flyte’s control plane restarts while a workflow that uses dynamic workflows and waiting for external inputs such as using the sleep node is running?
Is it able to gracefully survive the restart of the control plane and resume the traversal of the graph (for the dynamic workflow) and continue sleeping up till the expected duration (for the sleep node)?
07/20/2023, 7:49 AM
Ya it should. Flyte propeller maintains state in etcD, so it should survive even DB restarts. Also propeller is designed stateless, with discrete steps stored and fully recoverable- so even propeller down for a bit should not impact anything
The system is designed to be resilient to outages and failures in single
Execution in progress state is lost only when kubernetes crashes, but completed state can be fully recovered