<#1736 Fix: Remove broken option to pass str value...
# flyte-github
a
#1736 Fix: Remove broken option to pass str value as `nproc_per_node` in `task_config=Elastic` Pull request opened by fg91 TL;DR
torchrun
has an argument called
--nproc-per-node
which can be either
"auto"
,
"cpu"
,
"gpu"
or an integer (see here). The argument controls the size of the local worker groups on each worker node. Accordingly, the
ElasticPolicy
of the
PyTorchJob
CRD in the kubeflow training operator (which is used by the
flytekitplugins.kfpytorch
plugin) states:
Copy code
type ElasticPolicy struct {
        ...
	// Number of workers per node; supported values: [auto, cpu, gpu, int].
        ...
When building the torch elastic/torchrun Flyte plugin, I translated this into the following `@task(task_config=...)`:
Copy code
@dataclass
class Elastic(object):
    """
    Configuration for `torch elastic training <https://pytorch.org/docs/stable/elastic/run.html>`_.
    Use this to run single- or multi-node distributed pytorch elastic training on k8s.
    ...
    Args:
        nproc_per_node (Union[int, str]): Number of workers per node. Supported values are [auto, cpu, gpu, int].
    """"
* * * However, the kubeflow training operator makes a mistake here which I unfortunately didn't notice and propagated into flytekit:
Copy code
type ElasticPolicy struct {
        ...
	// Number of workers per node; supported values: [auto, cpu, gpu, int].
	NProcPerNode *int32 `json:"nProcPerNode,omitempty"`
        ...
The type does not allow string values despite the comment directly above. If a user of the Flyte elastic plugin passes a string in their task decorator, registration currently fails since the respective message in
flyteidl
also expects an int32. * * * • I created an issue to fix this in the kubeflow training operator: kubeflow/training-operator#1861 • In this PR, I remove the option to pass a string value to
@task_config=Elastic(nproc_per_node=...)
because this didn't work in the first place unfortunately. In case this will be fixed in the training operator, I will fix this here. Type ☑︎ Bug Fix ☐ Feature ☐ Plugin Are all requirements met? ☑︎ Code completed ☑︎ Smoke tested ☐ Unit tests added ☑︎ Code documentation added ☑︎ Any pending items have an associated Issue Follow-up issue NA flyteorg/flytekit All checks have passed 30/30 successful checks