I'm running into an issue with the PyTorch plugin....
# flyte-support
p
I'm running into an issue with the PyTorch plugin. When a job fails instead of printing out the error, I get something like this instead:
Copy code
Workflow[flytesnacks:development:.flytegen.my.run] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[myrun]. RuntimeExecutionError: failed during plugin execution, caused by: failed to read error file @[<s3://https>://s3-us-east-1.amazonaws.com/my-bucket/metadata/.../mrun/data/0/error-cb1d670ac8cc4eebbf475919a72f6b47.pb]: Conf container:my-bucket != Passed Container:https:. Dynamic loading is disabled: not found
I can look at the Grafana logs to see the error just fine, but it's still an issue. The task node appears as
Aborted
, while the execution status itself remains to be
Failed
.
This only happens with the PyTorch plugin, error handling works fine with the Ray plugin and regular tasks
@cool-lifeguard-49380 have you seen something like this before?
c
Just to confirm, you have the option
Error handling for distributed PyTorch tasks
here activated, correct?
What helm chart version do you use?
p
I'm using flyte-binary v1.15.1, and yes I have that value activated
c
It very much reminds me of this fix I had to do in our stow fork for gcs 🤔
p
configurations.inline.plugins.k8s.enable-distributed-error-aggregation
c
I have a 4 hour train ride on sunday, I’ll try to reproduce this.
I unfortunately don’t have access to a flyte deployment on s3 but hopefully the same happens with minio.
configurations.inline.plugins.k8s.enable-distributed-error-aggregation
when you turn this off, the error will definitely go away but I understand you might want to not turn it off.
I’ll get back to you on sunday
p
Sounds good, thanks for the help Fabio!
c
Hey @purple-father-70173, I tried reproducing the issue with my local dev setup using minio (as I said I unfortunately don't have access to an aws deployment). In this setup I can't reproduce the issue:
This is the code location where your
failed to read error file
error is returned.
In your error message I noticed that the bucket uri
<s3://https>:<//s3-us-east-1.amazonaws.com/my-bucket/metadata/>
has a duplicate protocol in
<s3://https>://
. This is a very similar error as I fixed for gcs in this PR mentioned above. The TL;DR of that PR was: • When using our stow fork to make an
ls
request on a
gs://
bucket in order to find the error files from the workers, the listed uris didn't have a prefix
gs://
but instead
<google://storage.googleapis.com/download/storage/v1/b/>
• When then using
stow
to read one of the listed error files with the faulty
<google://storage.googleapis.com/>...
prefix,
stow
didn't find the object. • The fix in the PR was that the uris returned from the List function also had the expected
gs://
prefix. I suspect that something similar his happening here: We make a list request using stow on a bucket
s3://
and the response somehow contains uris with an
https://
protocol and seemingly we end up with both. I suspect this is happening here or here in stow where the urls are constructed but unfortunately in the setup with minio I have available I never step into those branches because for minio
container.customEndpoint
is
"<http://localhost:30084>"
.
Do you happen to have a setup where you can run flytepropeller locally with a debugger so that we could take a look at this together in a call?
p
I don't have a flytepropeller setup, however the s3 bucket setup should be relatively easy to reproduce since AWS has free storage tiers for basic accounts. I'm using the flyte-binary chart for most of my deployments, I'm still working on moving to flyte-core