https://flyte.org logo
#ask-ai
Title
# ask-ai
m

Mücahit

03/20/2024, 2:22 PM
@Glime I would like to implement a new feature to Flyte for retrying nodes with a given delay like 10 minutes. If a node is eligible for retry it should be executed after 10 minutes, not immediately. sleep/delay within the same go process of flytepropeller is not sufficient as we have to store the next execution datetime somewhere in case flytepropeller pod dies.
g

Glime

03/20/2024, 2:23 PM
To implement a feature for retrying nodes with a given delay, such as 10 minutes, in Flyte, you would need to consider a few key aspects based on the context provided: 1. ​*Storing the Next Execution Datetime*​: Since you mentioned the need to store the next execution datetime somewhere in case the FlytePropeller pod dies, you would likely need to persist this information in a durable storage that FlytePropeller can access. This could be achieved by extending the Flyte's underlying data model to include a field for the next execution datetime for a task or node that is pending retry. This information could be stored in Flyte's metadata database, which is accessible to FlytePropeller upon restart. 2. ​*Modifying FlytePropeller Logic*​: The FlytePropeller component, which is responsible for executing tasks and workflows, would need to be modified to check if a node is eligible for retry and if the current time is past the stored next execution datetime before attempting to execute it again. This logic would involve: • Checking the eligibility for retry based on the existing retry policies and the new delay requirement. • Reading the next execution datetime from the persistent storage. • Comparing the current time with the stored datetime to decide whether to proceed with the retry. 3. ​*Configuration for Retry Delay*​: You would need to introduce a new configuration option that allows specifying the delay duration for retries. This configuration could be part of the task or workflow specifications, allowing users to define different delays for different tasks or workflows. 4. ​*Handling FlytePropeller Pod Restarts*​: Since there's a concern about FlytePropeller pod dying, ensuring that the retry logic is resilient to such failures is crucial. By storing the next execution datetime in a persistent storage, you ensure that the retry state is not lost. Upon restart, FlytePropeller would need to reload pending retries and their execution datetimes to continue processing them accordingly. 5. ​*Updating Documentation and User Guides*​: Once the feature is implemented, updating the Flyte documentation and user guides to explain how to use the new retry delay feature would be important for adoption. Given the complexity of this feature, it would also be beneficial to engage with the Flyte community through discussions on the Flyte GitHub repository or the Flyte Slack channel to gather feedback and possibly contributions. This approach leverages the information provided in the context about retry behaviors, configuration, and the architecture of Flyte, including the use of FlytePropeller and the metadata database. --- This is an AI-generated response, take its recommendations with caution. Please help us improve the model accuracy by giving this response a thumbs up/thumbs down reaction. 👍 button 👎 button
k

Ketan (kumare3)

03/20/2024, 2:30 PM
This should be easy please do not follow this bot. Infact failure time is recorded in etcd propellers state store. Next execution should be last time plus expected retry duration. I am in favor of this system. Happy to make a ticket and drop details in there if you’d implement
m

Mücahit

03/20/2024, 2:48 PM
This ticket is very relevant https://github.com/flyteorg/flyte/issues/2333 It would be great to have some pointers, a bit lost in the codebase
k

Ketan (kumare3)

03/20/2024, 8:10 PM
will do at some point today
3 Views