<#CP2HDHKE1|ask-the-community> I am having long ru...
# ask-the-community
s
#ask-the-community I am having long running flyteworkflows. I could see that in the FLyteUI i get timeout in NODE. Is there a configuration in FLyte where i can tell flyte to continue the job infinitely till it completes. I am dealing with ML workflows which could run upto 7 days.
d
@Sujith Samuel there are indeed configurable execution deadlines in the propeller config.
0s
means infinite for
node-active-deadline
,
node-execution-deadline
, and
workflow-active-deadline
.
s
We had 0s and it got cancelled after 48 hours.
d
Then it's not set correctly. What does the propeller configmap look like?
@Sujith Samuel just wanted to circle back here. Did you get this resolved?
s
We have set to 168h and waiting for some jobs to cross this
0s for sure did not work, it got cancelled after 48 hours
d
OK,
0s
should work. Can I see what your propeller config looks like? What does the workflow look like? What was the exact error message? I know we've ran into scenarios previously where a the
workflow-active-deadline
has exceeded and it automatically aborted node executions. If this is broken, I would be happy to look into it.
s
Hello @Dan Rammer (hamersaw), My propeller config is as in below propeller: default-deadlines: node-active-deadline: 168h node-execution-deadline: 168h workflow-active-deadline: 168h
Even with the above the node is cancelled after 48 h
d
yeah, so 48h is the default set in the flyte repo (on the helm charts). i know i updated this recently to be
0s
(not going to lookup the PR). i don't think this configuration is being picked up. let me look into it real quick.
s
I can set up a call and show you the workflow also if required.... no amount of configuration default, 7d, 168h seems to be working
d
I think you may be missing a
node-config
level in the heirarchy.
Copy code
propeller:
  node-config:
    default-deadlines:
      node-active-deadline: 168h
      node-execution-deadline: 168h
      workflow-active-deadline: 168h
Here is the issue. From the top-level propeller config it should be set here.
s
I will try this and let you know. Thanks a lot once again for your support
d
No problem! Hope this fixes this. We have discussed supporting
StrictMode
when parsing configuration so that if the yaml config does not exactly match what Flyte components expect they will error out. It would mitigate these kinds of issues - which we see more frequently than we would like. However, it introduces some issues with single binary, etc because we have a unified configuration that all components read from. It might be worth addressing this in more depth.
s
@Dan Rammer (hamersaw) the change worked great. Thanks a lot for your support.
i got a report from a user that they were able to run 5 day workflow. So this is good
d
@Sujith Samuel, that's great! Glad we could get this figured out 😄
155 Views