acoustic-carpenter-78188
07/12/2023, 2:29 PMtorchrun
has an argument called --nproc-per-node
which can be either "auto"
, "cpu"
, "gpu"
or an integer (see here). The argument controls the size of the local worker groups on each worker node.
Accordingly, the ElasticPolicy
of the PyTorchJob
CRD in the kubeflow training operator (which is used by the flytekitplugins.kfpytorch
plugin) states:
type ElasticPolicy struct {
...
// Number of workers per node; supported values: [auto, cpu, gpu, int].
...
When building the torch elastic/torchrun Flyte plugin, I translated this into the following `@task(task_config=...)`:
@dataclass
class Elastic(object):
"""
Configuration for `torch elastic training <https://pytorch.org/docs/stable/elastic/run.html>`_.
Use this to run single- or multi-node distributed pytorch elastic training on k8s.
...
Args:
nproc_per_node (Union[int, str]): Number of workers per node. Supported values are [auto, cpu, gpu, int].
""""
* * *
However, the kubeflow training operator makes a mistake here which I unfortunately didn't notice and propagated into flytekit:
type ElasticPolicy struct {
...
// Number of workers per node; supported values: [auto, cpu, gpu, int].
NProcPerNode *int32 `json:"nProcPerNode,omitempty"`
...
The type does not allow string values despite the comment directly above.
If a user of the Flyte elastic plugin passes a string in their task decorator, registration currently fails since the respective message in flyteidl
also expects an int32.
* * *
• I created an issue to fix this in the kubeflow training operator: kubeflow/training-operator#1861
• In this PR, I remove the option to pass a string value to @task_config=Elastic(nproc_per_node=...)
because this didn't work in the first place unfortunately. In case this will be fixed in the training operator, I will fix this here.
Type
☑︎ Bug Fix
☐ Feature
☐ Plugin
Are all requirements met?
☑︎ Code completed
☑︎ Smoke tested
☐ Unit tests added
☑︎ Code documentation added
☑︎ Any pending items have an associated Issue
Follow-up issue
NA
flyteorg/flytekit
✅ All checks have passed
30/30 successful checksacoustic-carpenter-78188
07/12/2023, 3:34 PM