< ancient wolf 19325> do tasks and workflows survive restart Flyte #ask-ai

Join Slack

<@U06RKAMK0UC> do tasks and workflows survive rest...

# ask-ai

bumpy-match-83743

06/28/2024, 2:29 PM

@ancient-wolf-19325 do tasks and workflows survive restarts?

ancient-wolf-19325

06/28/2024, 2:29 PM

Yes, Flyte tasks and workflows are designed to survive restarts. FlytePropeller maintains state in etcD, ensuring that even if the control plane or database restarts, the workflows can continue from where they left off. FlytePropeller is stateless, and discrete steps are stored and fully recoverable, making the system resilient to outages and failures in single components (source). Would you like more details on how Flyte handles restarts and state recovery? Sources: - https://discuss.flyte.org/t/13252961/hello-does-anyone-know-what-s-the-expected-behaviour-when-fl/ 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

bumpy-match-83743

06/28/2024, 2:32 PM

@ancient-wolf-19325 can you explain how flyte implements resiliency?

ancient-wolf-19325

06/28/2024, 2:32 PM

Flyte implements resiliency through several mechanisms: 1. State Management: FlytePropeller maintains state in etcD, ensuring workflows can survive control plane restarts. The system is designed to be stateless, with discrete steps stored and fully recoverable (source). 2. Retries: Flyte supports both user and system retries. User retries can be configured for tasks that fail due to recoverable errors, while system retries handle failures due to downstream system issues or transient failures (source). 3. Timeouts: Tasks can be configured with timeouts to ensure they do not run indefinitely. This helps the system make progress by marking tasks as failures if they exceed the specified runtime (source). 4. Caching/Memoization: Flyte supports caching of task outputs to avoid re-running tasks unnecessarily, which helps in recovering from failures without redoing all computations (source). Would you like more details on any specific mechanism? Sources: - https://discuss.flyte.org/t/13252961/hello-does-anyone-know-what-s-the-expected-behaviour-when-fl/ - https://flyte.org/blog/build-indestructible-pipelines-with-flyte/ 👍 1 button 👎 0 button 💡 Hint: Mention @ancient-wolf-19325 in the thread for followups.

freezing-airport-6809

06/28/2024, 2:38 PM

Also supports recover semantics

bumpy-match-83743

06/28/2024, 2:45 PM

I don't know what that means

freezing-airport-6809

06/28/2024, 2:50 PM

So Flyte can replay failed runs. This is done from the metadata store / postgres and s3 combination. You can use an api to recover - all succeeded nodes in the graph will be skipped and it will resume from the last failure node

freezing-airport-6809

06/28/2024, 2:51 PM

Check the recover button on a failed workflow

bumpy-match-83743

06/28/2024, 3:09 PM

Some of the link I've found suggest etcd is in the picture for state, but I'm not sure that true? I haven't seen that in the deployment models. I see Postgres and Object stores (RDS, S3) as we're on AWS

freezing-airport-6809

06/28/2024, 3:48 PM

Yes etcd is the state store

freezing-airport-6809

06/28/2024, 3:48 PM

We built it on k8s because k8s ships with etcd

👍 1

4 Views

Open in Slack

Previous Next