Edgar Trujillo
10/03/2022, 11:10 PMwf node
was aborted
due to - node timed out
but the node status is UNKNOWN
so the node duration continues to go up.
This node just called a sagemaker training job via the sdk with the following code:
estimator = Estimator(
dummy_params
)
# Start the job
estimator.fit(inputs=data_channels)
job_name = estimator.latest_training_job.job_name
<http://logging.info|logging.info>(f"Started sagemaker training job {job_name}")
return os.path.join(output_location, job_name, "output", "model.tar.gz")
SageMaker is showing the training job completed after 55 hours & a similar node was used in another wf and ran for 1 hour and that node execution shows success and cached the model artifact s3 uri.
It's happened a few times now and I don't want to continue kicking off training jobs that wont be cached by flyte
Does any one have an idea what can be causing this Node Timeout
?Ketan (kumare3)
Dan Rammer (hamersaw)
10/04/2022, 12:26 PMnode-execution-deadline
and node-active-deadline
which terminate nodes that are execution or active (ie. queued + running states) longer than 48h by default. You should increase these values accordingly. It is important to note that there is a workflow level configuration, namely workflow-active-deadline
, that works similarly and is defaulted at 72h. So you may need to update that as well.Edgar Trujillo
10/04/2022, 1:56 PM