Hi team.. I’m running into “Timeout in node” error...
# ask-the-community
r
Hi team.. I’m running into “Timeout in node” error for a long running workflow even though the config does have “inject-finalizer: true”. Any recommendations?
k
ohh you have hit the default timeout in node
can you set that to -1
@Dan Rammer (hamersaw) can we default this to unlimited?
r
will do.. thanks. I assumed the default was unlimited.. since the description said “set to max…”
also what is the default config?
d
@Rupsha Chaudhuri I know older deployments had a default value set for
node-execution-deadline
and / or
node-active-deadline
in the propeller configuration. However, we updated this so that they are defaulted to
0
or unlimited. Can you check your configuration on this? Hopefully we didn't miss anything.
r
Is default 0 or -1 if I specify at task level? I'll check the propeller configuration meanwhile
d
It should be
0
, ultimately this is the code this determines node timeouts.
r
Checked our config.. we don’t have
node-execution-deadline
or
node-active-deadline
anywhere
d
Do you know what version you're running?
This is the PR that updated the unconfigured default configuration to
0
for all of the deadlines.
r
propeller: v1.1.42
d
ok thanks, looks like this change landed in 1.1.44. so you could either update or set the deadlines explicitly on configuration.
r
I think bumping up is good.. that way all tasks benefit from it
thanks!
@Dan Rammer (hamersaw) can propeller be upgraded standalone? or is there a version compatibility I need to be mindful of with other components?
d
typically we make sure there is version compatibility between components in the same major flyte release (going to be easier with monorepo). manually looking at the release notes in this 2 version bump, you should not have any problems updating from 1.1.42 to 1.1.44
l
Sorry for jumping, we still observe this on Flyte 1.9. We already observe this over 3 months ago and added
inject-finalizer
a month ago. Usually there is an underlying problem that cause the tasks to run for very long and hit into this and it happen mostly to new users onboarding and developing their code. The state goes unknown and they cant access the logging url. We don’t have
node-execution-deadline
or
node-active-deadline
either. Did I configured it wrongly? The
default-env-vars
are working for us.
Copy code
configmap:
  k8s:
    plugins:
      k8s:
        default-env-vars:
           ....
        inject-finalizer: true
I attached a ss that happened <24 hours ago, for this user, retries is set to 0 with 60 mins task timeout, but it still get node timeout.
Copy code
@task(
retries=0,
timeout=timedelta(minutes=60))
cc: @Zi Yi Ewe @Krithika Sundararajan
d
@Lee Ning Jie Leon I do not understand the issue. The
timeout
configuration in the task decorator will make the node timeout. Is this not expected?
l
the behaviour and ui is different. When it is
Timeout in node
, the task state goes from
running
to
unknown
, becomes un-clickable and unable to access logs from the UI. The execution duration continue to run indefinitely on the UI. The task timeout will end with a
failed
state with logs and timeout duration stated. Checking further, I think the user removed the timeout in the recent version or reran an old workflow 🤔 , nevertheless we already have the
inject-finalizer
and don't expect the unknown state and inaccessibility to logs. From a user pov, the user has no idea what went wrong and its hard to debug. For some users, they thought the pod is running indefinitely without getting timeout. These said, I'll ask the user to retry with a new version and see if it still happen.
Copy code
[1/1] currentAttempt done. Last Error: USER::task execution timeout [1h0m0s] expired
Copy code
Timeout in node
d
OK, sounds like this is a UI issue in not handling node timeouts correctly then? Would you mind creating an issue for this?
r
+1 on the unknown state.. I encountered it as well
k
Unknown is the first state, it will progress to queued etc
r
Yes.. but after the task timed out, it went back to "unknown". In my case I know it ran for 50+ hours because the data was generated
k
48 hours was there default timeout
r
Right.. I now know about the original issue. I think the question now is about the state of the node becoming "unknown" later with no access to logs. I understand if it says "failed" with the reason being "node timeout"
l
raised it here, took a look at the db as well, seems like the execution is marked as
ABORTED
but UI is showing unknown. Might just be an UI bug afterall 😅