purple-father-70173
04/24/2025, 6:37 PMWorkflow[flytesnacks:development:.flytegen.my.run] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[myrun]. RuntimeExecutionError: failed during plugin execution, caused by: failed to read error file @[<s3://https>://s3-us-east-1.amazonaws.com/my-bucket/metadata/.../mrun/data/0/error-cb1d670ac8cc4eebbf475919a72f6b47.pb]: Conf container:my-bucket != Passed Container:https:. Dynamic loading is disabled: not found
I can look at the Grafana logs to see the error just fine, but it's still an issue. The task node appears as Aborted
, while the execution status itself remains to be Failed
.purple-father-70173
04/24/2025, 6:37 PMpurple-father-70173
04/25/2025, 5:52 PMcool-lifeguard-49380
04/25/2025, 6:56 PMError handling for distributed PyTorch tasks
here activated, correct?cool-lifeguard-49380
04/25/2025, 6:56 PMpurple-father-70173
04/25/2025, 6:57 PMcool-lifeguard-49380
04/25/2025, 6:58 PMpurple-father-70173
04/25/2025, 6:58 PMconfigurations.inline.plugins.k8s.enable-distributed-error-aggregation
cool-lifeguard-49380
04/25/2025, 7:00 PMcool-lifeguard-49380
04/25/2025, 7:00 PMcool-lifeguard-49380
04/25/2025, 7:01 PMconfigurations.inline.plugins.k8s.enable-distributed-error-aggregation
when you turn this off, the error will definitely go away but I understand you might want to not turn it off.cool-lifeguard-49380
04/25/2025, 7:01 PMpurple-father-70173
04/25/2025, 7:11 PMcool-lifeguard-49380
04/27/2025, 11:21 AMcool-lifeguard-49380
04/27/2025, 11:22 AMfailed to read error file
error is returned.cool-lifeguard-49380
04/27/2025, 11:31 AM<s3://https>:<//s3-us-east-1.amazonaws.com/my-bucket/metadata/>
has a duplicate protocol in <s3://https>://
.
This is a very similar error as I fixed for gcs in this PR mentioned above. The TL;DR of that PR was:
• When using our stow fork to make an ls
request on a gs://
bucket in order to find the error files from the workers, the listed uris didn't have a prefix gs://
but instead <google://storage.googleapis.com/download/storage/v1/b/>
• When then using stow
to read one of the listed error files with the faulty <google://storage.googleapis.com/>...
prefix, stow
didn't find the object.
• The fix in the PR was that the uris returned from the List function also had the expected gs://
prefix.
I suspect that something similar his happening here: We make a list request using stow on a bucket s3://
and the response somehow contains uris with an https://
protocol and seemingly we end up with both.
I suspect this is happening here or here in stow where the urls are constructed but unfortunately in the setup with minio I have available I never step into those branches because for minio container.customEndpoint
is "<http://localhost:30084>"
.cool-lifeguard-49380
04/27/2025, 11:32 AMpurple-father-70173
04/28/2025, 4:18 PM