Hi All, We have a LP that executes a batch workflo...
# ask-the-community
n
Hi All, We have a LP that executes a batch workflow on a
CronSchedule
of 2 mins. The situation we are seeing sometimes is that if flyte is down, due to whatever reason, say some issue with kubernetes, and the workflows are not able to execute, then the executions keep getting accumulated. When flyte comes back or is healthy, it tries to execute all of the backed up executions that causes a stress on our DB / APIs. We thought of using
fixed rate intervals
, but decided against it because if one execution gets stuck for whatever reason, then we the next execution will remain blocked and our processing is time sensitive. Is there a setting to limit un-executed executions or drop executions if they are not run within a certain window of them being created? Thanks!
s
i'm not so sure if that option exists. @Prafulla Mahindrakar would love to know if you've any ideas.
p
Unfortunately we dont have an option to turn off this catchup subsystem but should be easy to add this configuration . https://docs.flyte.org/en/latest/concepts/component_architecture/native_scheduler_architecture.html#catchupall-system Its an expected behavior for the scheduler to schedule any missed ones during its downtime. Currently without an option to turn off the catchup system , you can disable the schedules which you dont want to execute after the system comes back up and scheduler will take care of not catching up on those schedules
n
Yeah, we tried archiving the LP from the command line using flytectl, but unfortunately even that was not working when the system was down. When it came back, by the time we got to de-activating the LP, it had already launched a bunch of pending executions which overwhelmed our DB.
p
I see, So you brought down the entire flyte pods not just the scheduler and the proposed solution wont have worked in that situation. If you had delayed the scheduler pod rollout only then this would have been possible. Let me take AI to add a configurable option for the catchup system.
If you dont mind would you create an issue for it.
n
I can 🙂
p
Thanks please share the ticket on this thread so that we can priortize for the release with right tags on it
n
@Samhita Alla @Prafulla Mahindrakar Just wanted to confirm on the behavior of fixed rate intervals again. If we use that instead of the CronSchedule with an interval of 2 mins, does the next execution is launched after 2 mins only if the previous execution completes successfully or does it also executes 2 minutes if the previous execution fails?
s
i think the job has to run irrespective of the status of the previous run.