Hi All We have a LP that executes a batch workflow on a `Cro Flyte #flyte-support

Hi All, We have a LP that executes a batch workflo...

gorgeous-beach-23305

11/27/2023, 11:35 AM

Hi All, We have a LP that executes a batch workflow on a

CronSchedule

of 2 mins. The situation we are seeing sometimes is that if flyte is down, due to whatever reason, say some issue with kubernetes, and the workflows are not able to execute, then the executions keep getting accumulated. When flyte comes back or is healthy, it tries to execute all of the backed up executions that causes a stress on our DB / APIs. We thought of using

fixed rate intervals

, but decided against it because if one execution gets stuck for whatever reason, then we the next execution will remain blocked and our processing is time sensitive. Is there a setting to limit un-executed executions or drop executions if they are not run within a certain window of them being created? Thanks!

tall-lock-23197

11/28/2023, 8:35 AM

i'm not so sure if that option exists. @icy-agent-73298 would love to know if you've any ideas.

icy-agent-73298

11/28/2023, 6:31 PM

Unfortunately we dont have an option to turn off this catchup subsystem but should be easy to add this configuration . https://docs.flyte.org/en/latest/concepts/component_architecture/native_scheduler_architecture.html#catchupall-system Its an expected behavior for the scheduler to schedule any missed ones during its downtime. Currently without an option to turn off the catchup system , you can disable the schedules which you dont want to execute after the system comes back up and scheduler will take care of not catching up on those schedules

icy-agent-73298

11/28/2023, 6:32 PM

Ref : https://docs.flyte.org/projects/flytectl/en/latest/gen/flytectl_update_launchplan.html

gorgeous-beach-23305

11/29/2023, 10:19 AM

Yeah, we tried archiving the LP from the command line using flytectl, but unfortunately even that was not working when the system was down. When it came back, by the time we got to de-activating the LP, it had already launched a bunch of pending executions which overwhelmed our DB.

icy-agent-73298

11/30/2023, 3:50 AM

I see, So you brought down the entire flyte pods not just the scheduler and the proposed solution wont have worked in that situation. If you had delayed the scheduler pod rollout only then this would have been possible. Let me take AI to add a configurable option for the catchup system.

icy-agent-73298

11/30/2023, 3:51 AM

If you dont mind would you create an issue for it.

gorgeous-beach-23305

11/30/2023, 3:18 PM

I can 🙂

icy-agent-73298

11/30/2023, 9:26 PM

Thanks please share the ticket on this thread so that we can priortize for the release with right tags on it

gorgeous-beach-23305

12/06/2023, 10:52 AM

Done.. https://github.com/flyteorg/flyte/issues/4538

gratitude thank you 1

gorgeous-beach-23305

01/11/2024, 11:27 AM

@tall-lock-23197 @icy-agent-73298 Just wanted to confirm on the behavior of fixed rate intervals again. If we use that instead of the CronSchedule with an interval of 2 mins, does the next execution is launched after 2 mins only if the previous execution completes successfully or does it also executes 2 minutes if the previous execution fails?

tall-lock-23197

01/11/2024, 12:16 PM

i think the job has to run irrespective of the status of the previous run.

12 Views

Open in Slack

Previous Next