purple-father-70173
03/24/2025, 2:39 PMpyflyte
cli it fails with Workflow[flytesnacks:development:.flytegen...] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[...]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [pytorch]: kubeflow operator hasn't updated the pytorch custom resource since creation time 2025-03-24 14:12:47 +0000 UTC
but when I relaunch from the console the run succeeds as normal. Is there a difference between launching from CLI and console?purple-father-70173
03/24/2025, 2:46 PMtime="2025-03-24T14:12:47Z" level=info msg="PyTorchJob alrlhkm89snxz9vtkrb2-fxlds1bi-0 is created."
2025-03-24T14:12:47Z DEBUG No ElasicPolicy or Metric is specified, skipping HPA reconciling process {"pytorchjob": "alrlhkm89snxz9vtkrb2-fxlds1bi-0"}
time="2025-03-24T14:12:47Z" level=info msg="Reconciling for job alrlhkm89snxz9vtkrb2-fxlds1bi-0"
2025-03-24T14:12:47Z DEBUG events PyTorchJob alrlhkm89snxz9vtkrb2-fxlds1bi-0 is suspended. {"type": "Normal", "object": {"kind":"PyTorchJob","namespace":"lila","name":"alrlhkm89snxz9vtkrb2-fxlds1bi-0","uid":"0a4f3b1c-e4a3-4d6e-8745-d4b12c2cdab5","apiVersion":"<http://kubeflow.org/v1|kubeflow.org/v1>","resourceVersion":"11187608"}, "reason": "PyTorchJobSuspended"}
2025-03-24T14:12:47Z DEBUG No ElasicPolicy or Metric is specified, skipping HPA reconciling process {"pytorchjob": "alrlhkm89snxz9vtkrb2-fxlds1bi-0"}
time="2025-03-24T14:12:47Z" level=info msg="Reconciling for job alrlhkm89snxz9vtkrb2-fxlds1bi-0"
2025-03-24T14:12:47Z DEBUG events PyTorchJob alrlhkm89snxz9vtkrb2-fxlds1bi-0 is suspended. {"type": "Normal", "object": {"kind":"PyTorchJob","namespace":"lila","name":"alrlhkm89snxz9vtkrb2-fxlds1bi-0","uid":"0a4f3b1c-e4a3-4d6e-8745-d4b12c2cdab5","apiVersion":"<http://kubeflow.org/v1|kubeflow.org/v1>","resourceVersion":"11187610"}, "reason": "PyTorchJobSuspended"}
2025-03-24T14:13:49Z INFO reconcile cancelled, job does not need to do reconcile or has been deleted {"pytorchjob": {"name":"alrlhkm89snxz9vtkrb2-fxlds1bi-0","namespace":"lila"}, "sync": true, "deleted": true}
2025-03-24T14:13:49Z INFO <http://PyTorchJob.kubeflow.org|PyTorchJob.kubeflow.org> "alrlhkm89snxz9vtkrb2-fxlds1bi-0" not found {"pytorchjob": {"name":"alrlhkm89snxz9vtkrb2-fxlds1bi-0","namespace":"lila"}, "unable to fetch PyTorchJob": "lila/alrlhkm89snxz9vtkrb2-fxlds1bi-0"}
It's gone before the operator even gets a chance to reconciledamp-lion-88352
03/24/2025, 3:13 PMcool-lifeguard-49380
03/24/2025, 3:15 PMNo ElasicPolicy or Metric is specified, skipping HPA reconciling processThis is weird, it looks like it considers a pytorchjob without elastic policy as invalid 🤔
cool-lifeguard-49380
03/24/2025, 3:15 PMtask_config=PyTorch
or task_config=Elastic
?purple-father-70173
03/24/2025, 3:15 PMtask_config=Pytorch
purple-father-70173
03/24/2025, 3:16 PMcool-lifeguard-49380
03/24/2025, 3:23 PMpurple-father-70173
03/24/2025, 3:24 PMpurple-father-70173
03/24/2025, 3:25 PMcool-lifeguard-49380
03/24/2025, 3:43 PM1.7.0-rc.0
. I also only use Elastic tasks.
Three suggestions what you can try to debug:
• Try to use older operator version
• Try to use Elastic instead of Pytorch task config, then the elastic policy that the operator complained about is definitely set.
• Dump the pytorchjob manifest when launching with pyflyte and then the one when restarted from the console and compare.cool-lifeguard-49380
03/24/2025, 4:33 PMpurple-father-70173
03/24/2025, 4:38 PMpurple-father-70173
03/26/2025, 7:49 PMRunPolicy.active_deadline_seconds
, but I'm wondering if you're not running into this because you can also configure rdzv_config.join_timeout
in Elastic tasks.