Hello. When running a training job using the kubef...
# flyte-support
f
Hello. When running a training job using the kubeflow training operator, I couldn't find a way to make sure worker-0 is always the one that has the global rank 0. Then it becomes a guessing game to find out which one of the pods is the one that has the global rank 0 (which usually has more logging in most training frameworks). Has anyone had a similar issue?
f
Are you using elastic?
f
Yes
f
Why do you need to know which one is 0? Is this to debug using logs?
f
Yes
f
It does seem odd, as the ranks are preset. Hmm let me look into training operator
f
Any news @freezing-airport-6809? I think I am going to log the hostname for the rank0 to our tracking dashboard, but it would have been really nice if worker 0 was always rank 0.
f
@fierce-oil-47448 did you look into the CRD status
does it show the current assigned rank-0?
if so, we can easily plumb it from Flyteplugin automatically
f
@freezing-airport-6809 If you are asking PyTorchJob CRD status, nothing is there. However, if I look the pods themselves, I see that the one that is rank 0 has PET_RDZV_ENDPOINT variable set to its pod name:<port>