Hello. When running a training job using the kubef...
# ask-the-community
b
Hello. When running a training job using the kubeflow training operator, I couldn't find a way to make sure worker-0 is always the one that has the global rank 0. Then it becomes a guessing game to find out which one of the pods is the one that has the global rank 0 (which usually has more logging in most training frameworks). Has anyone had a similar issue?
k
Are you using elastic?
b
Yes
k
Why do you need to know which one is 0? Is this to debug using logs?
b
Yes
k
It does seem odd, as the ranks are preset. Hmm let me look into training operator
b
Any news @Ketan (kumare3)? I think I am going to log the hostname for the rank0 to our tracking dashboard, but it would have been really nice if worker 0 was always rank 0.
k
@Buğra Gedik did you look into the CRD status
does it show the current assigned rank-0?
if so, we can easily plumb it from Flyteplugin automatically
b
@Ketan (kumare3) If you are asking PyTorchJob CRD status, nothing is there. However, if I look the pods themselves, I see that the one that is rank 0 has PET_RDZV_ENDPOINT variable set to its pod name:<port>