Edgar Trujillo
10/03/2022, 11:10 PMwf node
was aborted
due to - node timed out
but the node status is UNKNOWN
so the node duration continues to go up.
This node just called a sagemaker training job via the sdk with the following code:
estimator = Estimator(
dummy_params
)
# Start the job
estimator.fit(inputs=data_channels)
job_name = estimator.latest_training_job.job_name
<http://logging.info|logging.info>(f"Started sagemaker training job {job_name}")
return os.path.join(output_location, job_name, "output", "model.tar.gz")
SageMaker is showing the training job completed after 55 hours & a similar node was used in another wf and ran for 1 hour and that node execution shows success and cached the model artifact s3 uri.
It's happened a few times now and I don't want to continue kicking off training jobs that wont be cached by flyte
Does any one have an idea what can be causing this Node Timeout
?Ketan (kumare3)
10/04/2022, 5:06 AMDan Rammer (hamersaw)
10/04/2022, 12:26 PMnode-execution-deadline
and node-active-deadline
which terminate nodes that are execution or active (ie. queued + running states) longer than 48h by default. You should increase these values accordingly. It is important to note that there is a workflow level configuration, namely workflow-active-deadline
, that works similarly and is defaulted at 72h. So you may need to update that as well.Edgar Trujillo
10/04/2022, 1:56 PM