Hello We have a workflow that fails with ```Timeout in node` Flyte #flyte-support

Hello! We have a workflow that fails with ```Timeo...

blue-ice-67112

07/18/2023, 1:22 PM

Hello! We have a workflow that fails with

Copy code

Timeout in node

but the task that fails gets marked as Unknown, so we can't click to it and refer to the logs etc. and Flyte(1.8) keeps tracking the task

blue-ice-67112

07/18/2023, 1:23 PM

task that fails has a 1h timeout but it does not marked as failed or timed out or sth like that and the duration that shows 11h now keeps getting increased(another execution is like 420h now)

blue-ice-67112

07/18/2023, 1:23 PM

I would expect the task to be marked as failed but it seems like Flyte looses track of the task

tall-lock-23197

07/19/2023, 5:08 AM

Is it possible for you to check the propeller logs?

blue-ice-67112

07/19/2023, 7:39 AM

blue-ice-67112

07/19/2023, 7:39 AM

@tall-lock-23197 these logs look quite relavant

blue-ice-67112

07/19/2023, 7:43 AM

and these

tall-lock-23197

07/19/2023, 8:25 AM

Can you set

inject-finalizers

to true in the flyte propeller config? More info available in the attached thread: https://discuss.flyte.org/t/2743027/hi-we-re-doing-some-performance-testing-and-when-we-start-a-#2409e7df-fda2-475e-a184-49e1be51e3ff

blue-ice-67112

07/19/2023, 9:02 AM

just did it, I'll see if will fix the issue tomorrow

blue-ice-67112

07/19/2023, 9:03 AM

but also, task is a spark task, will this still work 🤔

tall-lock-23197

07/19/2023, 9:18 AM

Are you asking if timeout works for spark task?

blue-ice-67112

07/19/2023, 9:20 AM

If adding

inject-finalizers

will make Flyte to mark the spark task as failed instead of it getting stuck at

unknown

tall-lock-23197

07/19/2023, 11:45 AM

I'm not sure; let's check if it's working for you. If not, we can think about how to resolve the issue.

👍 1

blue-ice-67112

07/20/2023, 6:08 AM

@tall-lock-23197 this seems to have fixed the issue! Thank you very much! I am also thinking maybe

inject-finalizers

should be enabled by default

⚡ 1

tall-lock-23197

07/20/2023, 6:19 AM

@hallowed-mouse-14616, what do you think?

blue-ice-67112

07/20/2023, 10:18 AM

It was a bit too early to say that it worked, now I see the same, state in the UI with these log ``````

tall-lock-23197

07/21/2023, 4:07 AM

Could you share the updated plugin config, please?

blue-ice-67112

07/21/2023, 12:43 PM

sure, this is the flyte-propeller-config configmap's relevant section

Copy code

k8s.yaml: |
    plugins:
      k8s:
        inject-finalizers: true

tall-lock-23197

07/21/2023, 12:52 PM

I think it needs to be

inject-finalizer

but not

inject-finalizers

hallowed-mouse-14616

07/24/2023, 1:48 PM

@blue-ice-67112 I'm having some difficulty following here. It seems all of the propeller logs you have linked are for different workflows? Some are for

python-task

, some for

athena

task, but you have mentioned spark tasks - are there issues with all of these? Is the timeout not working for everything?

blue-ice-67112

07/26/2023, 10:17 AM

@tall-lock-23197 true.. I changed the config but the timeout issue hasn't ocurred yet, it might happens once more this week

blue-ice-67112

07/26/2023, 10:19 AM

@hallowed-mouse-14616 it is one workflow with multiple tasks,

spark

and

python

as you can see in the screenshot. Workflow gets marked as failed with Timeout in node error message but the task that times out

read_data

one(spark-task) is not marked as failed but stays as

UNKNOWN

infinitely .

blue-ice-67112

07/27/2023, 6:47 AM

and the issue happened again, with inject-finalizer configuration

hallowed-mouse-14616

07/27/2023, 10:38 PM

@blue-ice-67112 so rereading this - there is a task that times out, the workflow fails with an error saying "Timeout in node", and the concern is that there is a task running in the background that Flyte lost track of?

hallowed-mouse-14616

07/27/2023, 10:39 PM

Does the Spark CR get created in the k8s cluster? Can you verify it exists and is still around when the workflow fails?

hallowed-mouse-14616

07/27/2023, 10:41 PM

I suspect there is never a k8s resource created because otherwise the task status would be something like

QUEUED

RUNNING

, the

UNKNOWN

leads me to believe that the task never began executing. and the increasing timestamp in the UI does not accurately reflect what is actually happening. If Flyte reports that a node started but doesn't have a node ending this duration will continue to tick in the UI. There may be a bug where on timeout Flyte misses a report for the task ending, but there is actually nothing executing.

broad-train-34581

08/03/2023, 5:50 AM

I hit into a similar issue. Our downstream had issue and were running slowly, so all the task (ShellTask) took longer time than usual. After the downstream was fixed, rerunning the job works. The UI timer is increasing infinitely and seems to be a visual bug only. We do not have the inject-finalizer config. I believe it went from

Running->TimingOut->TimedOut->Unknown

From propellor logs, attached a screenshot. After which it triggers deletion and marked the workflow as failed.

Copy code

Change in node state detected from [Running] -> [NodePhaseTimingOut], (handler phase [Timedout])
Recording NodeEvent [node_id:"n1" execution_id:<project:"..." domain:"production" name:"f4fb0fb246163cd71000" > ] phase[UNDEFINED]

Concern • Zero logs of what is happening being propagated to UI or why

Timeout in node

is happening • Metrics doesn't show in

flyte:propeller:all:node:failure_duration_ms

that executions failed at all. Metrics shows all execution for this workflow to be all successful. It seem that

inject-finalizer

flag might resolve the issue above, is there a reason why this is not enabled by default? Want to understand if there are any downside to it before i switch it on.

106 Views

Open in Slack

Previous Next