For distributed pytorch (or tf, …) tasks, the retu...
# ask-the-community
f
For distributed pytorch (or tf, …) tasks, the return value of which worker is passed along to subsequent tasks? Is this random/a race condition? When creating a Pytorch task, the
args
of both pods specify the same values for:
Copy code
- --output-prefix
    - gs://.../metadata/propeller/sandbox-development-f6695ca08aa47490c859/n0/data/0
    - --raw-output-data-prefix
    - gs://.../xq/f6695ca08aa47490c859-n0-0



    - --output-prefix
    - gs://...metadata/propeller/sandbox-development-f6695ca08aa47490c859/n0/data/0
    - --raw-output-data-prefix
    - gs://.../xq/f6695ca08aa47490c859-n0-0
In case both return the same value (as ist assumably often the case), it shouldn’t matter if both write. But in case I want to return a metric which might be slightly different for each worker, is it random which one I get?
k
No you can control this. For all worker processes you can raise an ignoreoutputs exception and let rank0 return output
f
I’ll just add an
if os.environ.get("RANK") != 0: raise IgnoreOutputs
then? But otherwise it is random? (Just to make sure I fully understand)
Thanks Ketan 🙂
k
this is a docs issue
f
I can document it, where should it go?
k
where would you put it from a user point
i think this is in mpioperator?
but it should be somewhere more discoverable
But other users might come from the tf or mpi page
So I think all of these pages should link to one place where it is documented.
# Control which rank returns its value
In distributed training the return values from different workers might differ. If you want to control which of the workers returns its return value to subsequent tasks in the workflow, you can raise a IgnoreOutputs exception for all other ranks.
If this had been written in the pytorch plugin docs page, I would have immediately known what I need to do. We could include this sentence here, here, and here (even though it gives me some sadness that it would be replicated).
Wdyt? Shell I add?
k
sure
cc @Niels Bantilan
f
101 Views