fierce-oil-47448
07/22/2024, 10:30 PMfreezing-airport-6809
fierce-oil-47448
07/22/2024, 11:37 PMfierce-oil-47448
07/22/2024, 11:38 PMfierce-oil-47448
07/22/2024, 11:38 PMfreezing-airport-6809
freezing-airport-6809
fierce-oil-47448
07/22/2024, 11:48 PMfreezing-airport-6809
fierce-oil-47448
07/23/2024, 5:06 AMfreezing-airport-6809
freezing-airport-6809
freezing-airport-6809
fierce-oil-47448
07/23/2024, 5:26 AMtry:
out = elastic_launch(
config=config,
entrypoint=launcher_target_func,
)(*launcher_args)
except ChildFailedError as e:
_, first_failure = e.get_first_failure()
if is_recoverable_worker_error(first_failure):
raise FlyteRecoverableException(e.format_msg())
else:
raise RuntimeError(e.format_msg())
except SignalException as e:
logger.exception(f"Elastic launch agent process terminating: {e}")
raise IgnoreOutputs()
It assumes that elastic launch either throws ChildFailedError or SignalException.
Here is what happens:
• We have 2 node training (0-7 ranks on one node, 8-15 ranks on another node)fierce-oil-47448
07/23/2024, 5:26 AMfierce-oil-47448
07/23/2024, 5:27 AMfierce-oil-47448
07/23/2024, 5:29 AMfreezing-airport-6809
freezing-airport-6809
freezing-airport-6809
fierce-oil-47448
07/23/2024, 6:38 AMcool-lifeguard-49380
07/23/2024, 6:59 AMerror-0.pb
, error-1.pb
, ... and the backend plugin would need to go through them and pick the one with the earliest time stamp. It would be nice if there was a general solution for plugins to be able to have multiple error files as torch is not the only plugin with this conceptional problem.
Would you be willing to colaborate on this with me @fierce-oil-47448? @freezing-airport-6809 maybe we can discuss in the next contrib sync how exactly this should be done.fierce-oil-47448
07/23/2024, 7:05 AMcool-lifeguard-49380
07/23/2024, 7:31 AMcool-lifeguard-49380
07/23/2024, 7:35 AMget_error_file_name
which by default returned None but which the elastic task would implement to return a file name based on the group rank env var, would we be able to call that function and access the error file name in in pod entrypoint script? That would already solve the uploading + not overwriting part. We’d still need logic in the backend to read multiple files of course …fierce-oil-47448
07/23/2024, 7:49 AMfierce-oil-47448
07/25/2024, 4:50 PMmetadata.labels."<http://training.kubeflow.org/replica-index|training.kubeflow.org/replica-index>"
:
env:
- name: FLYTE_PYTORCH_TASK_REPLICA_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.labels['training.kubeflow.org/replica-index']
Any suggestions on how to do that?fierce-oil-47448
07/25/2024, 4:52 PMcool-lifeguard-49380
07/25/2024, 5:47 PMcool-lifeguard-49380
07/25/2024, 5:49 PMcool-lifeguard-49380
07/25/2024, 5:49 PMcool-lifeguard-49380
07/25/2024, 5:51 PMChildError
in the elastic task into this error proto message but maybe we can use a proxy timestamp measured in the entrypoint? Or we introduce a new exception type in flytekit which has a timestamp attribute which we can set and raise in the elastic task? And the entrypoint would then check whether this type of exception was raised and would fill the value in the error proto. (Reading this again, this is probably better than measuring a second timestamp in the entrypoint, don't want to create another race condition.)cool-lifeguard-49380
07/25/2024, 5:52 PMtCtx.OutputWriter()
which is passed to outPaths
of NewRemoteFileOutputReader
contains a reference to the bucket uri which contains the `error.pb`:
<protocol>://<bucket name>/metadata/propeller/<project>-<domain>-<execution id>/n0/data/0
Notably, the outPaths
does not directly point to the error.pb
but to the “directory” containing it.cool-lifeguard-49380
07/25/2024, 5:52 PM--output-prefix
to pyflyte-execute
.cool-lifeguard-49380
07/25/2024, 5:54 PMRemoteFileOutputReader
implements the `OutputReader` interface.cool-lifeguard-49380
07/25/2024, 5:57 PMMultiErrorFileRemoteFileOutputReader
(maybe "earliest timestamp" should make it into the name). This would not try to read the <protocol>://<bucket name>/metadata/propeller/<project>-<domain>-<execution id>/n0/data/0/error.pb
but would search the <protocol>://<bucket name>/metadata/propeller/<project>-<domain>-<execution id>/n0/data/0
output prefix for files called e.g. error-*.pb
. It would then figure out which one has the earliest timestamp.cool-lifeguard-49380
07/25/2024, 5:57 PMcool-lifeguard-49380
07/25/2024, 5:58 PMMultipleErrorFiles
to PluginProperties
, see here. (Again, "earliest time stamp" should probably make it into that name ^^).cool-lifeguard-49380
07/25/2024, 6:00 PMNewRemoteFileOutputReader
, we do have access to e.plugin
, and thus to PluginProperties
and could make use of that information to instantiate another output reader.cool-lifeguard-49380
07/25/2024, 6:01 PMcool-lifeguard-49380
07/25/2024, 6:02 PMcool-lifeguard-49380
07/25/2024, 6:11 PMfierce-oil-47448
07/25/2024, 7:07 PMContainerError
(here) and not in Error
, as the former is what entry point uses to capture the error.
Everything you described makes a lot of sense. These changes look reasonably straightforward to me. I can help with the implementation, but I'd be slow (due to other priorities at work). Also, I'll need help with a set up where I can try backend changes (never done that).cool-lifeguard-49380
07/25/2024, 7:45 PMAlso, I'll need help with a set up where I can try backend changes (never done that).Happy to help, in which time zone are you?
cool-lifeguard-49380
07/25/2024, 7:45 PMcool-lifeguard-49380
07/25/2024, 7:45 PMfierce-oil-47448
07/25/2024, 7:59 PMcool-lifeguard-49380
07/26/2024, 10:44 AMfreezing-airport-6809
freezing-airport-6809
freezing-airport-6809
high-accountant-32689
07/26/2024, 2:31 PMcool-lifeguard-49380
07/26/2024, 3:28 PMThe thing is many folks are out this week, they will be back next weekNo time pressure
high-accountant-32689
07/26/2024, 9:07 PMerror.pb
protobuf message is an ErrorDocument
b. where we add the timestamp? in ContainerError
or ErrorDocument
c. Do we have to worry about time drift? (different containers having different clocks)
2. Custom Error File Naming: Modify the Flytekit pod entrypoint to generate multiple error files (e.g., error-0.pb, error-1.pb) based on the group rank. This avoids overwriting errors from different pods.
a. in https://github.com/flyteorg/flytekit/pull/2607/ we're generating random suffixes instead
3. Proposed backend Changes:
a. Implement a new output reader (e.g., MultiErrorFileRemoteFileOutputReader) in Flyte's plugin manager to read these multiple error files. This reader will search for files matching the pattern (e.g., error-*.pb) and select the one with the earliest timestamp.
i. Flyte turns ErrorDocument
into an ExecutionError
here and this is what ends up being reported downstream. We can leverage that in the implementation of ReadError
in this new output reader. Not sure if we have to turn this into a policy (i.e. does it always make sense to pick the first error sorted by timestamp or should that be configurable?)
b. Task Pod Modifications: Ensure the task pods write unique error files based on the group rank or replica index.
i. If we want to encode the rank in the error message, the PyTorch operator exposes the rank as an environment variable (see https://github.com/kubeflow/training-operator/blob/master/pkg/controller.v1/pytorch/envvar.go#L89-L96). * This PET_*
business seems to be a pytorch quirk introduced some time ago (I couldn't find what PET
stands for)
c. Additional updates to the Training Operator might be needed to ensure the replica index is available in the environment variables.cool-lifeguard-49380
07/26/2024, 10:22 PMcool-lifeguard-49380
07/26/2024, 10:27 PMNot sure if we have to turn this into a policy (i.e. does it always make sense to pick the first error sorted by timestamp or should that be configurable?)I also thought about making this a config or policy. I think we should do that so that when another dist plugin in the future needs a slightly different appriach, we don't have to write a new output reader but just a new policy.
cool-lifeguard-49380
07/27/2024, 11:46 AMcool-lifeguard-49380
07/27/2024, 11:47 AMfierce-oil-47448
07/29/2024, 4:20 PMcool-lifeguard-49380
07/29/2024, 4:21 PMfierce-oil-47448
07/30/2024, 7:28 PMcool-lifeguard-49380
07/30/2024, 7:39 PM• Add entry point support for multi error file upload based on aboveThis requires a flyteidl change to add timestamp to error document.
fierce-oil-47448
08/20/2024, 12:25 AMhigh-accountant-32689
08/20/2024, 12:35 AMfierce-oil-47448
08/20/2024, 12:47 AMhigh-accountant-32689
08/20/2024, 1:38 AMfierce-oil-47448
08/20/2024, 5:20 AMfierce-oil-47448
08/20/2024, 5:22 AMhigh-accountant-32689
08/20/2024, 2:55 PMfierce-oil-47448
08/20/2024, 3:07 PMhigh-accountant-32689
08/20/2024, 3:23 PMfierce-oil-47448
08/20/2024, 6:30 PMfierce-oil-47448
08/20/2024, 9:07 PM