Hello! We have a workflow that fails with ```Timeo...
# ask-the-community
m
Hello! We have a workflow that fails with
Copy code
Timeout in node
but the task that fails gets marked as Unknown, so we can't click to it and refer to the logs etc. and Flyte(1.8) keeps tracking the task
task that fails has a 1h timeout but it does not marked as failed or timed out or sth like that and the duration that shows 11h now keeps getting increased(another execution is like 420h now)
I would expect the task to be marked as failed but it seems like Flyte looses track of the task
s
Is it possible for you to check the propeller logs?
m
image.png
@Samhita Alla these logs look quite relavant
and these
s
Can you set
inject-finalizers
to true in the flyte propeller config? More info available in the attached thread: https://discuss.flyte.org/t/2743027/hi-we-re-doing-some-performance-testing-and-when-we-start-a-#2409e7df-fda2-475e-a184-49e1be51e3ff
m
just did it, I'll see if will fix the issue tomorrow
but also, task is a spark task, will this still work 🤔
s
Are you asking if timeout works for spark task?
m
If adding
inject-finalizers
will make Flyte to mark the spark task as failed instead of it getting stuck at
unknown
s
I'm not sure; let's check if it's working for you. If not, we can think about how to resolve the issue.
m
@Samhita Alla this seems to have fixed the issue! Thank you very much! I am also thinking maybe
inject-finalizers
should be enabled by default
s
@Dan Rammer (hamersaw), what do you think?
m
It was a bit too early to say that it worked, now I see the same, state in the UI with these log ``````
s
Could you share the updated plugin config, please?
m
sure, this is the flyte-propeller-config configmap's relevant section
Copy code
k8s.yaml: |
    plugins:
      k8s:
        inject-finalizers: true
s
I think it needs to be
inject-finalizer
but not
inject-finalizers
.
d
@Mücahit I'm having some difficulty following here. It seems all of the propeller logs you have linked are for different workflows? Some are for
python-task
, some for
athena
task, but you have mentioned spark tasks - are there issues with all of these? Is the timeout not working for everything?
m
@Samhita Alla true.. I changed the config but the timeout issue hasn't ocurred yet, it might happens once more this week
@Dan Rammer (hamersaw) it is one workflow with multiple tasks,
spark
and
python
as you can see in the screenshot. Workflow gets marked as failed with Timeout in node error message but the task that times out
read_data
one(spark-task) is not marked as failed but stays as
UNKNOWN
infinitely .
and the issue happened again, with inject-finalizer configuration
d
@Mücahit so rereading this - there is a task that times out, the workflow fails with an error saying "Timeout in node", and the concern is that there is a task running in the background that Flyte lost track of?
Does the Spark CR get created in the k8s cluster? Can you verify it exists and is still around when the workflow fails?
I suspect there is never a k8s resource created because otherwise the task status would be something like
QUEUED
or
RUNNING
, the
UNKNOWN
leads me to believe that the task never began executing. and the increasing timestamp in the UI does not accurately reflect what is actually happening. If Flyte reports that a node started but doesn't have a node ending this duration will continue to tick in the UI. There may be a bug where on timeout Flyte misses a report for the task ending, but there is actually nothing executing.
l
I hit into a similar issue. Our downstream had issue and were running slowly, so all the task (ShellTask) took longer time than usual. After the downstream was fixed, rerunning the job works. The UI timer is increasing infinitely and seems to be a visual bug only. We do not have the inject-finalizer config. I believe it went from
Running->TimingOut->TimedOut->Unknown
From propellor logs, attached a screenshot. After which it triggers deletion and marked the workflow as failed.
Copy code
Change in node state detected from [Running] -> [NodePhaseTimingOut], (handler phase [Timedout])
Recording NodeEvent [node_id:"n1" execution_id:<project:"..." domain:"production" name:"f4fb0fb246163cd71000" > ] phase[UNDEFINED]
Concern • Zero logs of what is happening being propagated to UI or why
Timeout in node
is happening • Metrics doesn't show in
flyte:propeller:all:node:failure_duration_ms
that executions failed at all. Metrics shows all execution for this workflow to be all successful. It seem that
inject-finalizer
flag might resolve the issue above, is there a reason why this is not enabled by default? Want to understand if there are any downside to it before i switch it on.