Has anyone ever run into this issue before? I'm ru...
# flyte-support
p
Has anyone ever run into this issue before? I'm running a pytorch distributed job that when run from the
pyflyte
cli it fails with
Workflow[flytesnacks:development:.flytegen...] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[...]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [pytorch]: kubeflow operator hasn't updated the pytorch custom resource since creation time 2025-03-24 14:12:47 +0000 UTC
but when I relaunch from the console the run succeeds as normal. Is there a difference between launching from CLI and console?
Relevant logs from the training operator:
Copy code
time="2025-03-24T14:12:47Z" level=info msg="PyTorchJob alrlhkm89snxz9vtkrb2-fxlds1bi-0 is created."
2025-03-24T14:12:47Z    DEBUG   No ElasicPolicy or Metric is specified, skipping HPA reconciling process        {"pytorchjob": "alrlhkm89snxz9vtkrb2-fxlds1bi-0"}
time="2025-03-24T14:12:47Z" level=info msg="Reconciling for job alrlhkm89snxz9vtkrb2-fxlds1bi-0"
2025-03-24T14:12:47Z    DEBUG   events  PyTorchJob alrlhkm89snxz9vtkrb2-fxlds1bi-0 is suspended.        {"type": "Normal", "object": {"kind":"PyTorchJob","namespace":"lila","name":"alrlhkm89snxz9vtkrb2-fxlds1bi-0","uid":"0a4f3b1c-e4a3-4d6e-8745-d4b12c2cdab5","apiVersion":"<http://kubeflow.org/v1|kubeflow.org/v1>","resourceVersion":"11187608"}, "reason": "PyTorchJobSuspended"}
2025-03-24T14:12:47Z    DEBUG   No ElasicPolicy or Metric is specified, skipping HPA reconciling process        {"pytorchjob": "alrlhkm89snxz9vtkrb2-fxlds1bi-0"}
time="2025-03-24T14:12:47Z" level=info msg="Reconciling for job alrlhkm89snxz9vtkrb2-fxlds1bi-0"
2025-03-24T14:12:47Z    DEBUG   events  PyTorchJob alrlhkm89snxz9vtkrb2-fxlds1bi-0 is suspended.        {"type": "Normal", "object": {"kind":"PyTorchJob","namespace":"lila","name":"alrlhkm89snxz9vtkrb2-fxlds1bi-0","uid":"0a4f3b1c-e4a3-4d6e-8745-d4b12c2cdab5","apiVersion":"<http://kubeflow.org/v1|kubeflow.org/v1>","resourceVersion":"11187610"}, "reason": "PyTorchJobSuspended"}
2025-03-24T14:13:49Z    INFO    reconcile cancelled, job does not need to do reconcile or has been deleted      {"pytorchjob": {"name":"alrlhkm89snxz9vtkrb2-fxlds1bi-0","namespace":"lila"}, "sync": true, "deleted": true}
2025-03-24T14:13:49Z    INFO    <http://PyTorchJob.kubeflow.org|PyTorchJob.kubeflow.org> "alrlhkm89snxz9vtkrb2-fxlds1bi-0" not found     {"pytorchjob": {"name":"alrlhkm89snxz9vtkrb2-fxlds1bi-0","namespace":"lila"}, "unable to fetch PyTorchJob": "lila/alrlhkm89snxz9vtkrb2-fxlds1bi-0"}
It's gone before the operator even gets a chance to reconcile
d
no really sure but maybe @cool-lifeguard-49380 knows it?
c
No ElasicPolicy or Metric is specified, skipping HPA reconciling process
This is weird, it looks like it considers a pytorchjob without elastic policy as invalid 🤔
Do you use
task_config=PyTorch
or
task_config=Elastic
?
p
task_config=Pytorch
Like I mentioned, it works if you relaunch from console, so I assume this is normal operation for the training-operator when no elastic policy is specified
c
To the best of my knowledge, elastic policy is not a requirement for pytorchjobs. Which training operator version are you using?
p
v1.8.1
latest before v2
c
I have to admit that I haven’t tested it with this operator version, I’m using
1.7.0-rc.0
. I also only use Elastic tasks. Three suggestions what you can try to debug: • Try to use older operator version • Try to use Elastic instead of Pytorch task config, then the elastic policy that the operator complained about is definitely set. • Dump the pytorchjob manifest when launching with pyflyte and then the one when restarted from the console and compare.
Pls let me know in case you find something. Also happy to try to understand observations. If there’s something wrong with the pytorch plugin, I’d like to fix it.
p
Absolutely! Thank you for giving me some pointers.
I think I've nailed this down to a race condition based on startup time for the job's workers. It seems that we can already configure this with
RunPolicy.active_deadline_seconds
, but I'm wondering if you're not running into this because you can also configure
rdzv_config.join_timeout
in Elastic tasks.