Mücahit
07/18/2023, 1:22 PMTimeout in node
but the task that fails gets marked as Unknown, so we can't click to it and refer to the logs etc. and Flyte(1.8) keeps tracking the taskSamhita Alla
Mücahit
07/19/2023, 7:39 AMSamhita Alla
inject-finalizers
to true in the flyte propeller config? More info available in the attached thread:
https://discuss.flyte.org/t/2743027/hi-we-re-doing-some-performance-testing-and-when-we-start-a-#2409e7df-fda2-475e-a184-49e1be51e3ffMücahit
07/19/2023, 9:02 AMSamhita Alla
Mücahit
07/19/2023, 9:20 AMinject-finalizers
will make Flyte to mark the spark task as failed instead of it getting stuck at unknown
Samhita Alla
Mücahit
07/20/2023, 6:08 AMinject-finalizers
should be enabled by defaultSamhita Alla
Mücahit
07/20/2023, 10:18 AMSamhita Alla
Mücahit
07/21/2023, 12:43 PMk8s.yaml: |
plugins:
k8s:
inject-finalizers: true
Samhita Alla
inject-finalizer
but not inject-finalizers
.Dan Rammer (hamersaw)
07/24/2023, 1:48 PMpython-task
, some for athena
task, but you have mentioned spark tasks - are there issues with all of these? Is the timeout not working for everything?Mücahit
07/26/2023, 10:17 AMspark
and python
as you can see in the screenshot.
Workflow gets marked as failed with Timeout in node error message but the task that times out read_data
one(spark-task) is not marked as failed but stays as UNKNOWN
infinitely .Dan Rammer (hamersaw)
07/27/2023, 10:38 PMQUEUED
or RUNNING
, the UNKNOWN
leads me to believe that the task never began executing. and the increasing timestamp in the UI does not accurately reflect what is actually happening. If Flyte reports that a node started but doesn't have a node ending this duration will continue to tick in the UI. There may be a bug where on timeout Flyte misses a report for the task ending, but there is actually nothing executing.Lee Ning Jie Leon
08/03/2023, 5:50 AMRunning->TimingOut->TimedOut->Unknown
From propellor logs, attached a screenshot. After which it triggers deletion and marked the workflow as failed.
Change in node state detected from [Running] -> [NodePhaseTimingOut], (handler phase [Timedout])
Recording NodeEvent [node_id:"n1" execution_id:<project:"..." domain:"production" name:"f4fb0fb246163cd71000" > ] phase[UNDEFINED]
Concern
• Zero logs of what is happening being propagated to UI or why Timeout in node
is happening
• Metrics doesn't show in flyte:propeller:all:node:failure_duration_ms
that executions failed at all. Metrics shows all execution for this workflow to be all successful.
It seem that inject-finalizer
flag might resolve the issue above, is there a reason why this is not enabled by default? Want to understand if there are any downside to it before i switch it on.