Hi, I'm seeing a weird error with workflow executi...
# flytekit
e
Hi, I'm seeing a weird error with workflow executions. Specifically seeing a
wf node
was
aborted
due to -
node timed out
but the node status is
UNKNOWN
so the node duration continues to go up. This node just called a sagemaker training job via the sdk with the following code:
Copy code
estimator = Estimator(
    dummy_params
)

# Start the job
estimator.fit(inputs=data_channels)
job_name = estimator.latest_training_job.job_name
<http://logging.info|logging.info>(f"Started sagemaker training job {job_name}")

return os.path.join(output_location, job_name, "output", "model.tar.gz")
SageMaker is showing the training job completed after 55 hours & a similar node was used in another wf and ran for 1 hour and that node execution shows success and cached the model artifact s3 uri. It's happened a few times now and I don't want to continue kicking off training jobs that wont be cached by flyte Does any one have an idea what can be causing this
Node Timeout
?
k
ohh no, this is the default timeout in node 😞
cc @Dan Rammer (hamersaw) can we make the default as infinite
d
@Ketan (kumare3) certainly! @Edgar Trujillo As mentioned ^^^ FlytePropeller has configuration values for
node-execution-deadline
and
node-active-deadline
which terminate nodes that are execution or active (ie. queued + running states) longer than 48h by default. You should increase these values accordingly. It is important to note that there is a workflow level configuration, namely
workflow-active-deadline
, that works similarly and is defaulted at 72h. So you may need to update that as well.
e
Thanks @Ketan (kumare3) @Dan Rammer (hamersaw)!
167 Views