Has anyone ever run into this issue before I m running a pyt Flyte #flyte-support

Has anyone ever run into this issue before? I'm ru...

purple-father-70173

03/24/2025, 2:39 PM

Has anyone ever run into this issue before? I'm running a pytorch distributed job that when run from the

pyflyte

cli it fails with

Workflow[flytesnacks:development:.flytegen...] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[...]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [pytorch]: kubeflow operator hasn't updated the pytorch custom resource since creation time 2025-03-24 14:12:47 +0000 UTC

but when I relaunch from the console the run succeeds as normal. Is there a difference between launching from CLI and console?

purple-father-70173

03/24/2025, 2:46 PM

Relevant logs from the training operator:

Copy code

time="2025-03-24T14:12:47Z" level=info msg="PyTorchJob alrlhkm89snxz9vtkrb2-fxlds1bi-0 is created."
2025-03-24T14:12:47Z    DEBUG   No ElasicPolicy or Metric is specified, skipping HPA reconciling process        {"pytorchjob": "alrlhkm89snxz9vtkrb2-fxlds1bi-0"}
time="2025-03-24T14:12:47Z" level=info msg="Reconciling for job alrlhkm89snxz9vtkrb2-fxlds1bi-0"
2025-03-24T14:12:47Z    DEBUG   events  PyTorchJob alrlhkm89snxz9vtkrb2-fxlds1bi-0 is suspended.        {"type": "Normal", "object": {"kind":"PyTorchJob","namespace":"lila","name":"alrlhkm89snxz9vtkrb2-fxlds1bi-0","uid":"0a4f3b1c-e4a3-4d6e-8745-d4b12c2cdab5","apiVersion":"<http://kubeflow.org/v1|kubeflow.org/v1>","resourceVersion":"11187608"}, "reason": "PyTorchJobSuspended"}
2025-03-24T14:12:47Z    DEBUG   No ElasicPolicy or Metric is specified, skipping HPA reconciling process        {"pytorchjob": "alrlhkm89snxz9vtkrb2-fxlds1bi-0"}
time="2025-03-24T14:12:47Z" level=info msg="Reconciling for job alrlhkm89snxz9vtkrb2-fxlds1bi-0"
2025-03-24T14:12:47Z    DEBUG   events  PyTorchJob alrlhkm89snxz9vtkrb2-fxlds1bi-0 is suspended.        {"type": "Normal", "object": {"kind":"PyTorchJob","namespace":"lila","name":"alrlhkm89snxz9vtkrb2-fxlds1bi-0","uid":"0a4f3b1c-e4a3-4d6e-8745-d4b12c2cdab5","apiVersion":"<http://kubeflow.org/v1|kubeflow.org/v1>","resourceVersion":"11187610"}, "reason": "PyTorchJobSuspended"}
2025-03-24T14:13:49Z    INFO    reconcile cancelled, job does not need to do reconcile or has been deleted      {"pytorchjob": {"name":"alrlhkm89snxz9vtkrb2-fxlds1bi-0","namespace":"lila"}, "sync": true, "deleted": true}
2025-03-24T14:13:49Z    INFO    <http://PyTorchJob.kubeflow.org|PyTorchJob.kubeflow.org> "alrlhkm89snxz9vtkrb2-fxlds1bi-0" not found     {"pytorchjob": {"name":"alrlhkm89snxz9vtkrb2-fxlds1bi-0","namespace":"lila"}, "unable to fetch PyTorchJob": "lila/alrlhkm89snxz9vtkrb2-fxlds1bi-0"}

It's gone before the operator even gets a chance to reconcile

damp-lion-88352

03/24/2025, 3:13 PM

no really sure but maybe @cool-lifeguard-49380 knows it?

cool-lifeguard-49380

03/24/2025, 3:15 PM

No ElasicPolicy or Metric is specified, skipping HPA reconciling process

This is weird, it looks like it considers a pytorchjob without elastic policy as invalid 🤔

cool-lifeguard-49380

03/24/2025, 3:15 PM

Do you use

task_config=PyTorch

task_config=Elastic

purple-father-70173

03/24/2025, 3:15 PM

task_config=Pytorch

purple-father-70173

03/24/2025, 3:16 PM

Like I mentioned, it works if you relaunch from console, so I assume this is normal operation for the training-operator when no elastic policy is specified

cool-lifeguard-49380

03/24/2025, 3:23 PM

To the best of my knowledge, elastic policy is not a requirement for pytorchjobs. Which training operator version are you using?

purple-father-70173

03/24/2025, 3:24 PM

v1.8.1

purple-father-70173

03/24/2025, 3:25 PM

latest before v2

cool-lifeguard-49380

03/24/2025, 3:43 PM

I have to admit that I haven’t tested it with this operator version, I’m using

1.7.0-rc.0

. I also only use Elastic tasks. Three suggestions what you can try to debug: • Try to use older operator version • Try to use Elastic instead of Pytorch task config, then the elastic policy that the operator complained about is definitely set. • Dump the pytorchjob manifest when launching with pyflyte and then the one when restarted from the console and compare.

cool-lifeguard-49380

03/24/2025, 4:33 PM

Pls let me know in case you find something. Also happy to try to understand observations. If there’s something wrong with the pytorch plugin, I’d like to fix it.

purple-father-70173

03/24/2025, 4:38 PM

Absolutely! Thank you for giving me some pointers.

purple-father-70173

03/26/2025, 7:49 PM

I think I've nailed this down to a race condition based on startup time for the job's workers. It seems that we can already configure this with

RunPolicy.active_deadline_seconds

, but I'm wondering if you're not running into this because you can also configure

rdzv_config.join_timeout

in Elastic tasks.

14 Views

Open in Slack

Previous Next