I m running into an issue with the PyTorch plugin When a job Flyte #flyte-support

I'm running into an issue with the PyTorch plugin....

purple-father-70173

04/24/2025, 6:37 PM

I'm running into an issue with the PyTorch plugin. When a job fails instead of printing out the error, I get something like this instead:

Copy code

Workflow[flytesnacks:development:.flytegen.my.run] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[myrun]. RuntimeExecutionError: failed during plugin execution, caused by: failed to read error file @[<s3://https>://s3-us-east-1.amazonaws.com/my-bucket/metadata/.../mrun/data/0/error-cb1d670ac8cc4eebbf475919a72f6b47.pb]: Conf container:my-bucket != Passed Container:https:. Dynamic loading is disabled: not found

I can look at the Grafana logs to see the error just fine, but it's still an issue. The task node appears as

Aborted

, while the execution status itself remains to be

Failed

purple-father-70173

04/24/2025, 6:37 PM

This only happens with the PyTorch plugin, error handling works fine with the Ray plugin and regular tasks

purple-father-70173

04/25/2025, 5:52 PM

@cool-lifeguard-49380 have you seen something like this before?

cool-lifeguard-49380

04/25/2025, 6:56 PM

Just to confirm, you have the option

Error handling for distributed PyTorch tasks

here activated, correct?

cool-lifeguard-49380

04/25/2025, 6:56 PM

What helm chart version do you use?

purple-father-70173

04/25/2025, 6:57 PM

I'm using flyte-binary v1.15.1, and yes I have that value activated

cool-lifeguard-49380

04/25/2025, 6:58 PM

It very much reminds me of this fix I had to do in our stow fork for gcs 🤔

purple-father-70173

04/25/2025, 6:58 PM

configurations.inline.plugins.k8s.enable-distributed-error-aggregation

cool-lifeguard-49380

04/25/2025, 7:00 PM

I have a 4 hour train ride on sunday, I’ll try to reproduce this.

cool-lifeguard-49380

04/25/2025, 7:00 PM

I unfortunately don’t have access to a flyte deployment on s3 but hopefully the same happens with minio.

cool-lifeguard-49380

04/25/2025, 7:01 PM

configurations.inline.plugins.k8s.enable-distributed-error-aggregation

when you turn this off, the error will definitely go away but I understand you might want to not turn it off.

cool-lifeguard-49380

04/25/2025, 7:01 PM

I’ll get back to you on sunday

purple-father-70173

04/25/2025, 7:11 PM

Sounds good, thanks for the help Fabio!

cool-lifeguard-49380

04/27/2025, 11:21 AM

Hey @purple-father-70173, I tried reproducing the issue with my local dev setup using minio (as I said I unfortunately don't have access to an aws deployment). In this setup I can't reproduce the issue:

cool-lifeguard-49380

04/27/2025, 11:22 AM

This is the code location where your

failed to read error file

error is returned.

cool-lifeguard-49380

04/27/2025, 11:31 AM

In your error message I noticed that the bucket uri

<s3://https>:<//s3-us-east-1.amazonaws.com/my-bucket/metadata/>

has a duplicate protocol in

<s3://https>://

. This is a very similar error as I fixed for gcs in this PR mentioned above. The TL;DR of that PR was: • When using our stow fork to make an

ls

request on a

gs://

bucket in order to find the error files from the workers, the listed uris didn't have a prefix

gs://

but instead

<google://storage.googleapis.com/download/storage/v1/b/>

• When then using

stow

to read one of the listed error files with the faulty

<google://storage.googleapis.com/>...

prefix,

stow

didn't find the object. • The fix in the PR was that the uris returned from the List function also had the expected

gs://

prefix. I suspect that something similar his happening here: We make a list request using stow on a bucket

s3://

and the response somehow contains uris with an

https://

protocol and seemingly we end up with both. I suspect this is happening here or here in stow where the urls are constructed but unfortunately in the setup with minio I have available I never step into those branches because for minio

container.customEndpoint

"<http://localhost:30084>"

cool-lifeguard-49380

04/27/2025, 11:32 AM

Do you happen to have a setup where you can run flytepropeller locally with a debugger so that we could take a look at this together in a call?

purple-father-70173

04/28/2025, 4:18 PM

I don't have a flytepropeller setup, however the s3 bucket setup should be relatively easy to reproduce since AWS has free storage tiers for basic accounts. I'm using the flyte-binary chart for most of my deployments, I'm still working on moving to flyte-core

5 Views

Open in Slack

Previous Next