what is the recommended set up for training distri...
# ask-the-community
n
what is the recommended set up for training distributed pytorch models? Found this guide but the kubeflow pytorch operator is now deprecated. is it better to use ray?
f
You can deploy the kubeflow training operator https://github.com/kubeflow/training-operator
It bundles pytorch, tensorflow, … into a single operator.
It’s a plug and play replacement in Flyte.
n
gratitude thank you
does the plugin config need to be updated as well?
f
Good catch, thanks
https://github.com/flyteorg/flyte/pull/3768 Created a PR for this as well
n
@Fabio Grätz thank you for the updates! I have been trying to run the pytorch_mnist example with the kf training operator but always got
RendezvousTimeoutError
on the
mnist_pytorch_job
task (see code snippets below). Also the task is not running as a
pytorchjob
but instead just a regular pod. Is it expected or am I missing something?
Copy code
@task(
    task_config=Elastic(
        nnodes=2,
        nproc_per_node=4,
    ),
    retries=2,
    cache=True,
    cache_version="1.0",
    requests=Resources(cpu=cpu_request, mem=mem_request, gpu=gpu_request),
    limits=Resources(mem=mem_limit, gpu=gpu_limit),
)
def mnist_pytorch_job(hp: Hyperparameters) -> TrainingOutputs:
f
Is the kubeflow training operator installed in the cluster and the pytorch plugin configured in the helm values?
As a first step: set nnodes to 1 (in this case it’s expected to run in a normal pod) and test whether it works then.
n
yes the kf training operator was installed and I was able to run the smote_dist test below are the update helm values. do they look right to you?
Copy code
configuration:
  inline:
    plugins:
      k8s:
        inject-finalizer: true
        default-env-vars:
          - AWS_METADATA_SERVICE_TIMEOUT: 5
          - AWS_METADATA_SERVICE_NUM_ATTEMPTS: 20
        default-pod-template-name: flyte-task-template
      tasks:
        task-plugins:
          enabled-plugins:
            - container
            - sidecar
            - k8s-array
            - pytorch
          default-for-task-types:
            - container: container
            - sidecar: sidecar
            - container_array: k8s-array
            - pytorch: pytorch
As a first step: set nnodes to 1 (in this case it’s expected to run in a normal pod) and test whether it works then.
failed on downloading the mnist data 🥲
Copy code
Use cuda True
Using device: cuda, world size: 2
Using distributed PyTorch with gloo backend
Downloading <http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz>
Downloading <http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz> to ../data/MNIST/raw/train-images-idx3-ubyte.gz
100%|██████████| 9912422/9912422 [00:00<00:00, 268571741.14it/s]
Use cuda True
Using device: cuda, world size: 2
Using distributed PyTorch with gloo backend
Downloading <http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz>
Downloading <http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz> to ../data/MNIST/raw/train-images-idx3-ubyte.gz
Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw

Use cuda True
Using device: cuda, world size: 2
Using distributed PyTorch with gloo backend
Downloading <http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz>
Downloading <http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz> to ../data/MNIST/raw/train-images-idx3-ubyte.gz
Use cuda True
Using device: cuda, world size: 2
Using distributed PyTorch with gloo backend
Downloading <http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz>
Downloading <http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz> to ../data/MNIST/raw/train-images-idx3-ubyte.gz
100%|██████████| 9912422/9912422 [00:00<00:00, 97443910.64it/s]
100%|██████████| 9912422/9912422 [00:00<00:00, 104604826.32it/s]


100%|██████████| 9912422/9912422 [00:00<00:00, 32470167.20it/s]
Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading <http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz>
Downloading <http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz> to ../data/MNIST/raw/train-labels-idx1-ubyte.gz
100%|██████████| 28881/28881 [00:00<00:00, 141019434.02it/s]
Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw

Downloading <http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz>
Downloading <http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz> to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz
100%|██████████| 1648877/1648877 [00:00<00:00, 9279339.05it/s]
Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading <http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz>
Downloading <http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz> to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz
100%|██████████| 4542/4542 [00:00<00:00, 28561512.40it/s]
failed (exitcode: 1) local_rank: 1 (pid: 28) of fn: spawn_helper (start_method: spawn)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 455, in _poll
    self._pc.join(-1)
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 371, in _wrap
    ret = record(fn)(*args_)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/flytekitplugins/kfpytorch/task.py", line 102, in spawn_helper
    return_val = fn(**kwargs)
  File "/app/kfpytorch/pytorch_mnist.py", line 206, in mnist_pytorch_job
    datasets.MNIST(
  File "/opt/conda/lib/python3.10/site-packages/torchvision/datasets/mnist.py", line 99, in __init__
    self.download()
  File "/opt/conda/lib/python3.10/site-packages/torchvision/datasets/mnist.py", line 187, in download
    download_and_extract_archive(url, download_root=self.raw_folder, filename=filename, md5=md5)
  File "/opt/conda/lib/python3.10/site-packages/torchvision/datasets/utils.py", line 447, in download_and_extract_archive
    download_url(url, download_root, filename, md5)
  File "/opt/conda/lib/python3.10/site-packages/torchvision/datasets/utils.py", line 168, in download_url
    raise RuntimeError("File not found or corrupted.")
RuntimeError: File not found or corrupted.
f
That’s not related to the distributed setup though.
Copy code
@task(task_config=Elastic(nnodes=1, nproc_per_node=4))
def train():
    import torch.distributed as dist
    dist.init_process_group(backend="nccl"/"gloo")
    dist.barrier()
To debug the distributed setup, you could use this minimal example (choose the desired backend).
Plugin config looks right
n
the minimal example works for nnodes=1, logs:
Copy code
tar: Removing leading `/' from member names

Closing process 22 via signal SIGTERM
Closing process 25 via signal SIGTERM
however for nnodes=2 it failed with
RendezvousTimeoutError
, and it is not running as a pytorchjob
f
Which version is your flyte propeller image?
The elastic training is fairly new, you need 1.6.0 at least.
n
is the flyte propeller image the same as the helm chart version?
f
Can you run
kubectl -n flyte get pod
to find out the flytepropeller pod name and then
kubectl -n flyte get pod <flytepropeller pod name> -o yaml | grep image
?
is the flyte propeller image the same as the helm chart version?
I think so but not 100% sure 🤷
Can you please also run
pip show flyteidl
?
n
there is only one pod
flyte-binary
, container image is cr.flyte.org/flyteorg/flyte-binary-release:v1.6.2
pip show flyteidl
on my local machine and the task container:
Copy code
Name: flyteidl
Version: 1.5.10
Summary: IDL for Flyte Platform
Home-page: <https://www.github.com/flyteorg/flyteidl>
Author: 
Author-email: 
License: 
Location: /opt/homebrew/Caskroom/miniconda/base/envs/glass/lib/python3.10/site-packages
Requires: googleapis-common-protos, protobuf, protoc-gen-swagger
Required-by: flytekit, flytekitplugins-kfpytorch
f
Ok, single binary 👍
Let’s do another sanity check, could you please replace the task_config=Elastic with PyTorch(num_workers=1)?
This one uses the pytorch operator as well but has been part of flyte for a long time.
I want to figure out whether the problem is that there is “some component that is not new enough for elastic training to work”.
n
Copy code
Traceback (most recent call last):

      File "/opt/conda/lib/python3.10/site-packages/flytekit/exceptions/scopes.py", line 206, in user_entry_point
        return wrapped(*args, **kwargs)
      File "/app/kfpytorch/simple.py", line 10, in train
        dist.init_process_group(backend="gloo")
      File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group
        store, rank, world_size = next(rendezvous_iterator)
      File "/opt/conda/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 236, in _env_rendezvous_handler
        rank = int(_get_env_or_raise("RANK"))
      File "/opt/conda/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 221, in _get_env_or_raise
        raise _env_error(env_var)

Message:

    Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
f
So also with
task_config=PyTorch
it starts a normal pod and not distributed training?
n
correct
f
Ok then the problem is not that something is not new enough, good to know.
The training operator pod is running in the kubeflow namespace?
`kubectl`get pytorchjobs.kubeflow.org also doesn’t complain that the resource type doesn’t exist?
n
no. I was able to run the smoke test:
Copy code
(base)  % kubectl get <http://pytorchjobs.kubeflow.org|pytorchjobs.kubeflow.org>                                                                                                                                                                                                                      tmp/tmp (main ⚡) protopias-mbp
NAME                          STATE       AGE
pytorch-dist-basic-sendrecv   Succeeded   3h48m
f
Mh 🤔
Copy code
enabled_plugins:
    tasks:
      task-plugins:
        enabled-plugins:
          - container
          - sidecar
          - k8s-array
          - pytorch
        default-for-task-types:
          container: container
          sidecar: sidecar
          container_array: k8s-array
          pytorch: pytorch
This is the part of our config which activates the pytorch plugin.
Same as yours though, isn’t it?
n
how did you apply the config, through helm values?
f
yes
We don’t run the single binary though, I cannot say whether the config needs to be different in this case unfortunately.
n
hmm, that could be the reason because for single binary I need to put it in the inline config section
Copy code
configuration:
  inline:
    plugins:
      k8s:
        inject-finalizer: true
        default-env-vars:
          - AWS_METADATA_SERVICE_TIMEOUT: 5
          - AWS_METADATA_SERVICE_NUM_ATTEMPTS: 20
        default-pod-template-name: flyte-task-template
      tasks:
        task-plugins:
          enabled-plugins:
            - container
            - sidecar
            - k8s-array
            - pytorch
          default-for-task-types:
            - container: container
            - sidecar: sidecar
            - container_array: k8s-array
            - pytorch: pytorch
the final cm looks
Copy code
apiVersion: v1
data:
  000-core.yaml: "admin:\n  endpoint: localhost:8089\n  insecure: true\ncatalog-cache:\n
    \ endpoint: localhost:8081\n  insecure: true\n  type: datacatalog\ncluster_resources:\n
    \ standaloneDeployment: false\n  templatePath: /etc/flyte/cluster-resource-templates\nlogger:\n
    \ show-source: true\n  level: 1\npropeller:\n  create-flyteworkflow-crd: true\nwebhook:\n
    \ certDir: /var/run/flyte/certs\n  localCert: true\n  secretName: sgs-flyte-binary-webhook-secret\n
    \ serviceName: sgs-flyte-binary-webhook\n  servicePort: 443\nflyte: \n  admin:\n
    \   disableClusterResourceManager: false\n    disableScheduler: false\n    disabled:
    false\n    seedProjects:\n    - flytesnacks\n  dataCatalog:\n    disabled: false\n
    \ propeller:\n    disableWebhook: false\n    disabled: false\n"
  001-plugins.yaml: |
    tasks:
      task-plugins:
        enabled-plugins:
          - container
          - sidecar
          - K8S-ARRAY
        default-for-task-types:
          - container: container
          - container_array: K8S-ARRAY
    plugins:
      logs:
        kubernetes-enabled: false
        cloudwatch-enabled: false
        stackdriver-enabled: false
      k8s:
        co-pilot:
          image: "<http://cr.flyte.org/flyteorg/flytecopilot-release:v1.6.2|cr.flyte.org/flyteorg/flytecopilot-release:v1.6.2>"
      k8s-array:
        logs:
          config:
            kubernetes-enabled: false
            cloudwatch-enabled: false
            stackdriver-enabled: false
  010-inline-config.yaml: |
    plugins:
      k8s:
        default-env-vars:
        - AWS_METADATA_SERVICE_TIMEOUT: 5
        - AWS_METADATA_SERVICE_NUM_ATTEMPTS: 20
        default-pod-template-name: flyte-task-template
        inject-finalizer: true
      tasks:
        task-plugins:
          default-for-task-types:
          - container: container
          - sidecar: sidecar
          - container_array: k8s-array
          - pytorch: pytorch
          enabled-plugins:
          - container
          - sidecar
          - k8s-array
          - pytorch
    storage:
      cache:
        max_size_mbs: 100
        target_gc_percent: 100
    task_resources:
      defaults:
        cpu: 200m
        memory: 1Gi
        storage: 128Gi
      limits:
        cpu: 64
        gpu: 8
        memory: 488Gi
        storage: 8Ti
kind: ConfigMap
f
Here
001-plugins.yaml: |
it’s not configured?
n
yeah like I said i can only provide the config values through
010-inline-config.yaml
to the flyte binary chart
I am trying patching the plugins.yaml directly with kustomize now
f
You could, as a hack, delete the training operator and the pytorchjob crd and then install flyte. If the pytorch plugin is correctly configured, flytepropeller will fail to start if the pytorchjob crd is not there.
This might give you an easy way to identify when the pytorch plugin part in the config is being honored.
n
will flyte propeller fail on executing or registering the workflow?
I am trying patching the plugins.yaml directly with kustomize now
this works! below is the patch
Copy code
patches:
  - target:
      kind: ConfigMap
      name: flyte-binary-config
    patch: |-
      - op: replace
        path: /data/001-plugins.yaml
        value: |
          tasks:
            task-plugins:
              enabled-plugins:
                - container
                - sidecar
                - k8s-array
                - pytorch
              default-for-task-types:
                container: container
                sidecar: sidecar
                container_array: k8s-array
                pytorch: pytorch
          plugins:
            logs:
              kubernetes-enabled: false
              cloudwatch-enabled: false
              stackdriver-enabled: false
            k8s:
              co-pilot:
                image: "<http://cr.flyte.org/flyteorg/flytecopilot-release:v1.6.2|cr.flyte.org/flyteorg/flytecopilot-release:v1.6.2>"
            k8s-array:
              logs:
                config:
                  kubernetes-enabled: false
                  cloudwatch-enabled: false
                  stackdriver-enabled: false
so it seems for flyte binary the inline config values do not work for plugins. it has to be patched directly on
001-plugins.yaml
f
cool that it works 🙂
n
Thank you for all the help!
f
🙂
130 Views