Hi I m seeing a weird error with workflow executions Specifi Flyte #flytekit

Hi, I'm seeing a weird error with workflow executi...

bumpy-morning-40916

10/03/2022, 11:10 PM

Hi, I'm seeing a weird error with workflow executions. Specifically seeing a

wf node

was

aborted

due to -

node timed out

but the node status is

UNKNOWN

so the node duration continues to go up. This node just called a sagemaker training job via the sdk with the following code:

Copy code

estimator = Estimator(
    dummy_params
)

# Start the job
estimator.fit(inputs=data_channels)
job_name = estimator.latest_training_job.job_name
<http://logging.info|logging.info>(f"Started sagemaker training job {job_name}")

return os.path.join(output_location, job_name, "output", "model.tar.gz")

SageMaker is showing the training job completed after 55 hours & a similar node was used in another wf and ran for 1 hour and that node execution shows success and cached the model artifact s3 uri. It's happened a few times now and I don't want to continue kicking off training jobs that wont be cached by flyte Does any one have an idea what can be causing this

Node Timeout

freezing-airport-6809

10/04/2022, 5:06 AM

ohh no, this is the default timeout in node 😞

freezing-airport-6809

10/04/2022, 5:06 AM

cc @hallowed-mouse-14616 can we make the default as infinite

freezing-airport-6809

10/04/2022, 5:07 AM

https://docs.flyte.org/en/latest/deployment/cluster_config/scheduler_config.html#node-execution-deadline-config-duration

hallowed-mouse-14616

10/04/2022, 12:26 PM

@freezing-airport-6809 certainly! @bumpy-morning-40916 As mentioned ^^^ FlytePropeller has configuration values for

node-execution-deadline

and

node-active-deadline

which terminate nodes that are execution or active (ie. queued + running states) longer than 48h by default. You should increase these values accordingly. It is important to note that there is a workflow level configuration, namely

workflow-active-deadline

, that works similarly and is defaulted at 72h. So you may need to update that as well.

bumpy-morning-40916

10/04/2022, 1:56 PM

Thanks @freezing-airport-6809 @hallowed-mouse-14616!

👍 1

170 Views

Open in Slack

Previous Next