Flyte implements resiliency through several mechanisms:
1.
State Management: FlytePropeller maintains state in etcD, ensuring workflows can survive control plane restarts. The system is designed to be stateless, with discrete steps stored and fully recoverable (
source).
2.
Retries: Flyte supports both user and system retries. User retries can be configured for tasks that fail due to recoverable errors, while system retries handle failures due to downstream system issues or transient failures (
source).
3.
Timeouts: Tasks can be configured with timeouts to ensure they do not run indefinitely. This helps the system make progress by marking tasks as failures if they exceed the specified runtime (
source).
4.
Caching/Memoization: Flyte supports caching of task outputs to avoid re-running tasks unnecessarily, which helps in recovering from failures without redoing all computations (
source).
Would you like more details on any specific mechanism?
Sources:
-
https://discuss.flyte.org/t/13252961/hello-does-anyone-know-what-s-the-expected-behaviour-when-fl/
-
https://flyte.org/blog/build-indestructible-pipelines-with-flyte/ 👍 1 button 👎 0 button 💡
Hint: Mention
@ancient-wolf-19325 in the thread for followups.