SageMaker is showing the training job completed after 55 hours & a similar node was used in another wf and ran for 1 hour and that node execution shows success and cached the model artifact s3 uri.
It's happened a few times now and I don't want to continue kicking off training jobs that wont be cached by flyte
Does any one have an idea what can be causing this
Node Timeout
?
f
freezing-airport-6809
10/04/2022, 5:06 AM
ohh no, this is the default timeout in node 😞
freezing-airport-6809
10/04/2022, 5:06 AM
cc @hallowed-mouse-14616 can we make the default as infinite
@freezing-airport-6809 certainly!
@bumpy-morning-40916 As mentioned ^^^ FlytePropeller has configuration values for
node-execution-deadline
and
node-active-deadline
which terminate nodes that are execution or active (ie. queued + running states) longer than 48h by default. You should increase these values accordingly. It is important to note that there is a workflow level configuration, namely
workflow-active-deadline
, that works similarly and is defaulted at 72h. So you may need to update that as well.