<@U06RKAMK0UC> do tasks and workflows survive rest...
# ask-ai
a
@RunLLM do tasks and workflows survive restarts?
r
Yes, Flyte tasks and workflows are designed to survive restarts. FlytePropeller maintains state in etcD, ensuring that even if the control plane or database restarts, the workflows can continue from where they left off. FlytePropeller is stateless, and discrete steps are stored and fully recoverable, making the system resilient to outages and failures in single components (source). Would you like more details on how Flyte handles restarts and state recovery? Sources: - https://discuss.flyte.org/t/13252961/hello-does-anyone-know-what-s-the-expected-behaviour-when-fl/ 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.
a
@RunLLM can you explain how flyte implements resiliency?
r
Flyte implements resiliency through several mechanisms: 1. State Management: FlytePropeller maintains state in etcD, ensuring workflows can survive control plane restarts. The system is designed to be stateless, with discrete steps stored and fully recoverable (source). 2. Retries: Flyte supports both user and system retries. User retries can be configured for tasks that fail due to recoverable errors, while system retries handle failures due to downstream system issues or transient failures (source). 3. Timeouts: Tasks can be configured with timeouts to ensure they do not run indefinitely. This helps the system make progress by marking tasks as failures if they exceed the specified runtime (source). 4. Caching/Memoization: Flyte supports caching of task outputs to avoid re-running tasks unnecessarily, which helps in recovering from failures without redoing all computations (source). Would you like more details on any specific mechanism? Sources: - https://discuss.flyte.org/t/13252961/hello-does-anyone-know-what-s-the-expected-behaviour-when-fl/ - https://flyte.org/blog/build-indestructible-pipelines-with-flyte/ 👍 1 button 👎 0 button 💡 Hint: Mention @RunLLM in the thread for followups.
k
Also supports recover semantics
a
I don't know what that means
k
So Flyte can replay failed runs. This is done from the metadata store / postgres and s3 combination. You can use an api to recover - all succeeded nodes in the graph will be skipped and it will resume from the last failure node
Check the recover button on a failed workflow
a
Some of the link I've found suggest etcd is in the picture for state, but I'm not sure that true? I haven't seen that in the deployment models. I see Postgres and Object stores (RDS, S3) as we're on AWS
k
Yes etcd is the state store
We built it on k8s because k8s ships with etcd