Hello When running a training job using the kubeflow trainin Flyte #flyte-support

Hello. When running a training job using the kubef...

fierce-oil-47448

06/16/2024, 3:58 AM

Hello. When running a training job using the kubeflow training operator, I couldn't find a way to make sure worker-0 is always the one that has the global rank 0. Then it becomes a guessing game to find out which one of the pods is the one that has the global rank 0 (which usually has more logging in most training frameworks). Has anyone had a similar issue?

freezing-airport-6809

06/16/2024, 4:37 AM

Are you using elastic?

fierce-oil-47448

06/16/2024, 5:02 AM

Yes

freezing-airport-6809

06/16/2024, 5:14 AM

Why do you need to know which one is 0? Is this to debug using logs?

fierce-oil-47448

06/16/2024, 8:24 PM

Yes

freezing-airport-6809

06/16/2024, 10:16 PM

It does seem odd, as the ranks are preset. Hmm let me look into training operator

fierce-oil-47448

06/19/2024, 6:00 PM

Any news @freezing-airport-6809? I think I am going to log the hostname for the rank0 to our tracking dashboard, but it would have been really nice if worker 0 was always rank 0.

freezing-airport-6809

06/19/2024, 6:27 PM

@fierce-oil-47448 did you look into the CRD status

freezing-airport-6809

06/19/2024, 6:27 PM

does it show the current assigned rank-0?

freezing-airport-6809

06/19/2024, 6:27 PM

if so, we can easily plumb it from Flyteplugin automatically

fierce-oil-47448

06/19/2024, 7:31 PM

@freezing-airport-6809 If you are asking PyTorchJob CRD status, nothing is there. However, if I look the pods themselves, I see that the one that is rank 0 has PET_RDZV_ENDPOINT variable set to its pod name:<port>

15 Views

Open in Slack

Previous Next