purple-father-70173
04/24/2025, 6:37 PMWorkflow[flytesnacks:development:.flytegen.my.run] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[myrun]. RuntimeExecutionError: failed during plugin execution, caused by: failed to read error file @[<s3://https>://s3-us-east-1.amazonaws.com/my-bucket/metadata/.../mrun/data/0/error-cb1d670ac8cc4eebbf475919a72f6b47.pb]: Conf container:my-bucket != Passed Container:https:. Dynamic loading is disabled: not found
I can look at the Grafana logs to see the error just fine, but it's still an issue. The task node appears as Aborted, while the execution status itself remains to be Failed.purple-father-70173
04/24/2025, 6:37 PMpurple-father-70173
04/25/2025, 5:52 PMcool-lifeguard-49380
04/25/2025, 6:56 PMError handling for distributed PyTorch tasks here activated, correct?cool-lifeguard-49380
04/25/2025, 6:56 PMpurple-father-70173
04/25/2025, 6:57 PMcool-lifeguard-49380
04/25/2025, 6:58 PMpurple-father-70173
04/25/2025, 6:58 PMconfigurations.inline.plugins.k8s.enable-distributed-error-aggregationcool-lifeguard-49380
04/25/2025, 7:00 PMcool-lifeguard-49380
04/25/2025, 7:00 PMcool-lifeguard-49380
04/25/2025, 7:01 PMconfigurations.inline.plugins.k8s.enable-distributed-error-aggregation when you turn this off, the error will definitely go away but I understand you might want to not turn it off.cool-lifeguard-49380
04/25/2025, 7:01 PMpurple-father-70173
04/25/2025, 7:11 PMcool-lifeguard-49380
04/27/2025, 11:21 AMcool-lifeguard-49380
04/27/2025, 11:22 AMfailed to read error file error is returned.cool-lifeguard-49380
04/27/2025, 11:31 AM<s3://https>:<//s3-us-east-1.amazonaws.com/my-bucket/metadata/> has a duplicate protocol in <s3://https>://.
This is a very similar error as I fixed for gcs in this PR mentioned above. The TL;DR of that PR was:
• When using our stow fork to make an ls request on a gs:// bucket in order to find the error files from the workers, the listed uris didn't have a prefix gs:// but instead <google://storage.googleapis.com/download/storage/v1/b/>
• When then using stow to read one of the listed error files with the faulty <google://storage.googleapis.com/>... prefix, stow didn't find the object.
• The fix in the PR was that the uris returned from the List function also had the expected gs:// prefix.
I suspect that something similar his happening here: We make a list request using stow on a bucket s3:// and the response somehow contains uris with an https:// protocol and seemingly we end up with both.
I suspect this is happening here or here in stow where the urls are constructed but unfortunately in the setup with minio I have available I never step into those branches because for minio container.customEndpoint is "<http://localhost:30084>".cool-lifeguard-49380
04/27/2025, 11:32 AMpurple-father-70173
04/28/2025, 4:18 PM