blue-ice-67112
07/18/2023, 1:22 PMTimeout in node
but the task that fails gets marked as Unknown, so we can't click to it and refer to the logs etc. and Flyte(1.8) keeps tracking the taskblue-ice-67112
07/18/2023, 1:23 PMblue-ice-67112
07/18/2023, 1:23 PMtall-lock-23197
blue-ice-67112
07/19/2023, 7:39 AMblue-ice-67112
07/19/2023, 7:39 AMblue-ice-67112
07/19/2023, 7:43 AMtall-lock-23197
inject-finalizers to true in the flyte propeller config? More info available in the attached thread:
https://discuss.flyte.org/t/2743027/hi-we-re-doing-some-performance-testing-and-when-we-start-a-#2409e7df-fda2-475e-a184-49e1be51e3ffblue-ice-67112
07/19/2023, 9:02 AMblue-ice-67112
07/19/2023, 9:03 AMtall-lock-23197
blue-ice-67112
07/19/2023, 9:20 AMinject-finalizers will make Flyte to mark the spark task as failed instead of it getting stuck at unknowntall-lock-23197
blue-ice-67112
07/20/2023, 6:08 AMinject-finalizers should be enabled by defaulttall-lock-23197
blue-ice-67112
07/20/2023, 10:18 AMtall-lock-23197
blue-ice-67112
07/21/2023, 12:43 PMk8s.yaml: |
plugins:
k8s:
inject-finalizers: truetall-lock-23197
inject-finalizer but not inject-finalizers.hallowed-mouse-14616
07/24/2023, 1:48 PMpython-task , some for athena task, but you have mentioned spark tasks - are there issues with all of these? Is the timeout not working for everything?blue-ice-67112
07/26/2023, 10:17 AMblue-ice-67112
07/26/2023, 10:19 AMspark and python as you can see in the screenshot.
Workflow gets marked as failed with Timeout in node error message but the task that times out read_data one(spark-task) is not marked as failed but stays as UNKNOWN infinitely .blue-ice-67112
07/27/2023, 6:47 AMhallowed-mouse-14616
07/27/2023, 10:38 PMhallowed-mouse-14616
07/27/2023, 10:39 PMhallowed-mouse-14616
07/27/2023, 10:41 PMQUEUED or RUNNING, the UNKNOWN leads me to believe that the task never began executing. and the increasing timestamp in the UI does not accurately reflect what is actually happening. If Flyte reports that a node started but doesn't have a node ending this duration will continue to tick in the UI. There may be a bug where on timeout Flyte misses a report for the task ending, but there is actually nothing executing.broad-train-34581
08/03/2023, 5:50 AMRunning->TimingOut->TimedOut->Unknown From propellor logs, attached a screenshot. After which it triggers deletion and marked the workflow as failed.
Change in node state detected from [Running] -> [NodePhaseTimingOut], (handler phase [Timedout])
Recording NodeEvent [node_id:"n1" execution_id:<project:"..." domain:"production" name:"f4fb0fb246163cd71000" > ] phase[UNDEFINED]
Concern
⢠Zero logs of what is happening being propagated to UI or why Timeout in node is happening
⢠Metrics doesn't show in flyte:propeller:all:node:failure_duration_ms that executions failed at all. Metrics shows all execution for this workflow to be all successful.
It seem that inject-finalizer flag might resolve the issue above, is there a reason why this is not enabled by default? Want to understand if there are any downside to it before i switch it on.