Hi team I m running into Timeout in node error for a long ru Flyte #flyte-support

Hi team.. I’m running into “Timeout in node” error...

little-cricket-84530

10/24/2023, 5:42 PM

Hi team.. I’m running into “Timeout in node” error for a long running workflow even though the config does have “inject-finalizer: true”. Any recommendations?

➕ 1

freezing-airport-6809

10/24/2023, 6:52 PM

ohh you have hit the default timeout in node

freezing-airport-6809

10/24/2023, 6:52 PM

can you set that to -1

freezing-airport-6809

10/24/2023, 6:52 PM

@hallowed-mouse-14616 can we default this to unlimited?

little-cricket-84530

10/24/2023, 6:53 PM

will do.. thanks. I assumed the default was unlimited.. since the description said “set to max…”

little-cricket-84530

10/24/2023, 7:10 PM

also what is the default config?

hallowed-mouse-14616

10/25/2023, 3:04 PM

@little-cricket-84530 I know older deployments had a default value set for

node-execution-deadline

and / or

node-active-deadline

in the propeller configuration. However, we updated this so that they are defaulted to

or unlimited. Can you check your configuration on this? Hopefully we didn't miss anything.

little-cricket-84530

10/25/2023, 3:09 PM

Is default 0 or -1 if I specify at task level? I'll check the propeller configuration meanwhile

hallowed-mouse-14616

10/25/2023, 3:12 PM

It should be

, ultimately this is the code this determines node timeouts.

little-cricket-84530

10/25/2023, 4:18 PM

Checked our config.. we don’t have

node-execution-deadline

node-active-deadline

anywhere

hallowed-mouse-14616

10/25/2023, 4:18 PM

Do you know what version you're running?

hallowed-mouse-14616

10/25/2023, 4:20 PM

This is the PR that updated the unconfigured default configuration to

for all of the deadlines.

little-cricket-84530

10/25/2023, 4:22 PM

propeller: v1.1.42

hallowed-mouse-14616

10/25/2023, 4:23 PM

ok thanks, looks like this change landed in 1.1.44. so you could either update or set the deadlines explicitly on configuration.

little-cricket-84530

10/25/2023, 4:23 PM

I think bumping up is good.. that way all tasks benefit from it

little-cricket-84530

10/25/2023, 4:23 PM

thanks!

👍 1

little-cricket-84530

10/25/2023, 7:19 PM

@hallowed-mouse-14616 can propeller be upgraded standalone? or is there a version compatibility I need to be mindful of with other components?

hallowed-mouse-14616

10/25/2023, 7:58 PM

typically we make sure there is version compatibility between components in the same major flyte release (going to be easier with monorepo). manually looking at the release notes in this 2 version bump, you should not have any problems updating from 1.1.42 to 1.1.44

🙏🏼 1

broad-train-34581

10/26/2023, 4:50 AM

Sorry for jumping, we still observe this on Flyte 1.9. We already observe this over 3 months ago and added

inject-finalizer

a month ago. Usually there is an underlying problem that cause the tasks to run for very long and hit into this and it happen mostly to new users onboarding and developing their code. The state goes unknown and they cant access the logging url. We don’t have

node-execution-deadline

node-active-deadline

either. Did I configured it wrongly? The

default-env-vars

are working for us.

Copy code

configmap:
  k8s:
    plugins:
      k8s:
        default-env-vars:
           ....
        inject-finalizer: true

I attached a ss that happened <24 hours ago, for this user, retries is set to 0 with 60 mins task timeout, but it still get node timeout.

Copy code

@task(
retries=0,
timeout=timedelta(minutes=60))

cc: @powerful-animal-86823 @best-actor-6858

hallowed-mouse-14616

10/26/2023, 12:59 PM

@broad-train-34581 I do not understand the issue. The

timeout

configuration in the task decorator will make the node timeout. Is this not expected?

broad-train-34581

10/26/2023, 2:20 PM

the behaviour and ui is different. When it is

Timeout in node

, the task state goes from

running

unknown

, becomes un-clickable and unable to access logs from the UI. The execution duration continue to run indefinitely on the UI. The task timeout will end with a

failed

state with logs and timeout duration stated. Checking further, I think the user removed the timeout in the recent version or reran an old workflow 🤔 , nevertheless we already have the

inject-finalizer

and don't expect the unknown state and inaccessibility to logs. From a user pov, the user has no idea what went wrong and its hard to debug. For some users, they thought the pod is running indefinitely without getting timeout. These said, I'll ask the user to retry with a new version and see if it still happen.

Copy code

[1/1] currentAttempt done. Last Error: USER::task execution timeout [1h0m0s] expired

Copy code

Timeout in node

hallowed-mouse-14616

10/30/2023, 1:15 PM

OK, sounds like this is a UI issue in not handling node timeouts correctly then? Would you mind creating an issue for this?

little-cricket-84530

10/30/2023, 1:48 PM

+1 on the unknown state.. I encountered it as well

freezing-airport-6809

10/30/2023, 2:18 PM

Unknown is the first state, it will progress to queued etc

little-cricket-84530

10/30/2023, 2:31 PM

Yes.. but after the task timed out, it went back to "unknown". In my case I know it ran for 50+ hours because the data was generated

freezing-airport-6809

10/30/2023, 2:32 PM

48 hours was there default timeout

little-cricket-84530

10/30/2023, 2:34 PM

Right.. I now know about the original issue. I think the question now is about the state of the node becoming "unknown" later with no access to logs. I understand if it says "failed" with the reason being "node timeout"

broad-train-34581

10/31/2023, 4:23 AM

raised it here, took a look at the db as well, seems like the execution is marked as

ABORTED

but UI is showing unknown. Might just be an UI bug afterall 😅

🙌 1

67 Views

Open in Slack

Previous Next