acoustic-carpenter-78188
07/31/2023, 7:57 PMWorkflowNodes. Basically, a launchplan is executed by FlytePropeller sending an execution request admin, which then starts the launchplan, and FlytePropeller stores the execution ID in the WorkflowNode state. At each iteration FlytePropeller checks the status of the FlyteWorkflow CR represented by the execution ID and updates the WorkflowNode state accordingly.
What is happening in the issue linked below is FlyteAdmin is failing to start the launchplan. FlytePropeller detects this failure and in doing so maintains the proposed execution ID in the WorkflowNode state (here) and transitions the node to a failed state. When FlytePropeller attempts to event this state to FlyteAdmin, it checks whether the execution ID exists(here). Of course since FlyteAdmin failed to start the launchplan the execution ID does not exist. This failure results in the Workflow does not exist error that we see. And ultimately, FlytePropeller proceeds with aborting the WorkflowNode, which is entirely unnecessary.
To fix this, there are two possible solutions:
(1) If a launchplan fails to start by a user error (ex.invalid type interface), we do not set the execution ID on the WorkflowNode state because the execution ID was never started. Of course, this means that we trust FlyteAdmin to report user errors only when the launchplan was not able to execute -- I think this is reasonable. This is implemented in this PR.
(2) Allow FlyteAdmin to fail checking the existence of an execution ID for events that report a failed state.
Tracking Issue
unionai/cloud#4172
Follow-up issue
NA
flyteorg/flytepropeller
GitHub Actions: Build & Push Flytepropeller Image
GitHub Actions: Goreleaser
GitHub Actions: Bump Version
✅ 11 other checks have passed
11/14 successful checksacoustic-carpenter-78188
07/31/2023, 7:57 PMacoustic-carpenter-78188
08/04/2023, 6:30 PM