Could anyone comment on the queue implementation in Flyte Ar Flyte #flyte-support

Could anyone comment on the queue implementation i...

glamorous-rainbow-77959

09/05/2024, 8:16 AM

Could anyone comment on the queue implementation in Flyte? Are those queues durable? What happens to the queue state in case of flyte-binary unexpected restart? Are there any settings we can change with respect to queue durability? CC @square-carpet-13590 @clever-exabyte-82294

👍 1

ripe-smartphone-56353

09/05/2024, 8:20 AM

I can't speak to the queue implementation - but one thing that helped our workflows to survive restarts was this setting

gratitude thank you 1

glamorous-rainbow-77959

09/05/2024, 8:21 AM

Thanks for this, @ripe-smartphone-56353

glamorous-rainbow-77959

09/05/2024, 8:22 AM

@clever-exabyte-82294, should we enable this setting as well? Unrelated to queue durability, but looks useful for us as well.

glamorous-rainbow-77959

09/05/2024, 8:23 AM

@ripe-smartphone-56353 by the way, did it add any overhead time for your workflows?

ripe-smartphone-56353

09/05/2024, 8:33 AM

by the way, did it add any overhead time for your workflows?

I don't know.

glamorous-rainbow-77959

09/05/2024, 8:52 AM

I have checked the source code, looks like Flyte queues are backed by

workqueue

implementation, which is a non-durable in-memory queue AFAIK. Here is the starting code for queue implementation: https://github.com/flyteorg/flyte/blob/bed761c33c40af23750467c828afea553c0b80a0/flytepropeller/pkg/controller/workqueue.go#L7

glamorous-rainbow-77959

09/05/2024, 8:56 AM

It looks like underneath it uses Kubernetes implementation of queues. Not entirely sure, but I see interfaces for queues everywhere and no internal implementations

hallowed-mouse-14616

09/05/2024, 2:22 PM

@glamorous-rainbow-77959 when referring to queues IIUC your inquiring about the durability of workflow executions in the case of failures? Flyte achieves this by building FlytePropeller as a k8s operator. So when a workflow is executed we create a K8s CR (FlyteWorkflow). Using an informer cache, FlytePropeller operates over this FlyteWorkflow CR, updating the execution status as progress is made. These updates are designed to be singular and idempotent; for example, if we create a k8s Pod to execute a task we update the CR that the pod was created before moving to the next operation (ex. checking status of pod). This way, if there is a failure in storing state (or complete failure of FlytePropeller) we reattempt the operation (in this case creating the pod) and see it already exists before attempting to store state again. So in a long-winded way, answering your question about Flytes queue implementation. The internal

queue

is basically a list of FyteWorkflow CR IDs that FlytePropeller knows to process. If FlytePropeller dies and comes back up, this is repopulated from the list of non-terminal FlyteWorkflow CRs. All state for these executions is durably stored in etcd as part of k8s apiserver ops. @average-finland-92144 not sure if we have docs highlighting the durability guarantees? It may be worth including this information in a FlytePropeller series?

average-finland-92144

09/05/2024, 2:35 PM

not sure if we have docs highlighting the durability guarantees?

I don't think this is explicitly covered in the docs

It may be worth including this information in a FlytePropeller series?

Absolutely

🙏 1

freezing-airport-6809

09/05/2024, 2:44 PM

It is durable across restarts or even if the entire system Is down for hours

freezing-airport-6809

09/05/2024, 2:45 PM

It is not durable across k8s loss, but you can recover any execution as long as db is around. If db is lost then all durability is lost (possible to recover from blob store but pretty hard)

glamorous-rainbow-77959

09/23/2024, 8:26 AM

@hallowed-mouse-14616 thanks for the explanation. My usecase was a bit different: suppose that max_parallelism of a workflow is set to 10, but 100 execution requests arrive. 10 executions will start immediately, while the remaining 90 workflows will be queued up. So, would these queued workflows survive FlytePropeller restarts? Do we create CRs for workflows immediately even if they are waiting due to constraints like max_parallelism?

freezing-airport-6809

09/23/2024, 11:31 AM

Yes they will survive- you will not lose any workflows

58 Views

Open in Slack

Previous Next