what is the recommended set up for training distributed pyto Flyte #flyte-support

what is the recommended set up for training distri...

shy-accountant-549

06/09/2023, 4:47 PM

what is the recommended set up for training distributed pytorch models? Found this guide but the kubeflow pytorch operator is now deprecated. is it better to use ray?

cool-lifeguard-49380

06/09/2023, 5:12 PM

You can deploy the kubeflow training operator https://github.com/kubeflow/training-operator

cool-lifeguard-49380

06/09/2023, 5:12 PM

It bundles pytorch, tensorflow, … into a single operator.

cool-lifeguard-49380

06/09/2023, 5:12 PM

It’s a plug and play replacement in Flyte.

cool-lifeguard-49380

06/09/2023, 5:21 PM

https://github.com/flyteorg/flytesnacks/pull/996 I adapted the docs

shy-accountant-549

06/09/2023, 5:43 PM

gratitude thank you

shy-accountant-549

06/09/2023, 6:03 PM

does the plugin config need to be updated as well?

cool-lifeguard-49380

06/10/2023, 9:19 AM

Good catch, thanks

cool-lifeguard-49380

06/10/2023, 9:19 AM

https://github.com/flyteorg/flyte/pull/3768 Created a PR for this as well

🙌 1

shy-accountant-549

06/12/2023, 2:58 PM

@cool-lifeguard-49380 thank you for the updates! I have been trying to run the pytorch_mnist example with the kf training operator but always got

RendezvousTimeoutError

on the

mnist_pytorch_job

task (see code snippets below). Also the task is not running as a

pytorchjob

but instead just a regular pod. Is it expected or am I missing something?

Copy code

@task(
    task_config=Elastic(
        nnodes=2,
        nproc_per_node=4,
    ),
    retries=2,
    cache=True,
    cache_version="1.0",
    requests=Resources(cpu=cpu_request, mem=mem_request, gpu=gpu_request),
    limits=Resources(mem=mem_limit, gpu=gpu_limit),
)
def mnist_pytorch_job(hp: Hyperparameters) -> TrainingOutputs:

cool-lifeguard-49380

06/12/2023, 3:33 PM

Is the kubeflow training operator installed in the cluster and the pytorch plugin configured in the helm values?

cool-lifeguard-49380

06/12/2023, 3:34 PM

As a first step: set nnodes to 1 (in this case it’s expected to run in a normal pod) and test whether it works then.

👀 1

shy-accountant-549

06/12/2023, 3:37 PM

yes the kf training operator was installed and I was able to run the smote_dist test below are the update helm values. do they look right to you?

Copy code

configuration:
  inline:
    plugins:
      k8s:
        inject-finalizer: true
        default-env-vars:
          - AWS_METADATA_SERVICE_TIMEOUT: 5
          - AWS_METADATA_SERVICE_NUM_ATTEMPTS: 20
        default-pod-template-name: flyte-task-template
      tasks:
        task-plugins:
          enabled-plugins:
            - container
            - sidecar
            - k8s-array
            - pytorch
          default-for-task-types:
            - container: container
            - sidecar: sidecar
            - container_array: k8s-array
            - pytorch: pytorch

shy-accountant-549

06/12/2023, 4:01 PM

As a first step: set nnodes to 1 (in this case it’s expected to run in a normal pod) and test whether it works then.

failed on downloading the mnist data 🥲

Copy code

Use cuda True
Using device: cuda, world size: 2
Using distributed PyTorch with gloo backend
Downloading <http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz>
Downloading <http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz> to ../data/MNIST/raw/train-images-idx3-ubyte.gz
100%|██████████| 9912422/9912422 [00:00<00:00, 268571741.14it/s]
Use cuda True
Using device: cuda, world size: 2
Using distributed PyTorch with gloo backend
Downloading <http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz>
Downloading <http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz> to ../data/MNIST/raw/train-images-idx3-ubyte.gz
Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw

Use cuda True
Using device: cuda, world size: 2
Using distributed PyTorch with gloo backend
Downloading <http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz>
Downloading <http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz> to ../data/MNIST/raw/train-images-idx3-ubyte.gz
Use cuda True
Using device: cuda, world size: 2
Using distributed PyTorch with gloo backend
Downloading <http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz>
Downloading <http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz> to ../data/MNIST/raw/train-images-idx3-ubyte.gz
100%|██████████| 9912422/9912422 [00:00<00:00, 97443910.64it/s]
100%|██████████| 9912422/9912422 [00:00<00:00, 104604826.32it/s]


100%|██████████| 9912422/9912422 [00:00<00:00, 32470167.20it/s]
Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading <http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz>
Downloading <http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz> to ../data/MNIST/raw/train-labels-idx1-ubyte.gz
100%|██████████| 28881/28881 [00:00<00:00, 141019434.02it/s]
Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw

Downloading <http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz>
Downloading <http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz> to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz
100%|██████████| 1648877/1648877 [00:00<00:00, 9279339.05it/s]
Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading <http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz>
Downloading <http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz> to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz
100%|██████████| 4542/4542 [00:00<00:00, 28561512.40it/s]
failed (exitcode: 1) local_rank: 1 (pid: 28) of fn: spawn_helper (start_method: spawn)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 455, in _poll
    self._pc.join(-1)
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 371, in _wrap
    ret = record(fn)(*args_)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/flytekitplugins/kfpytorch/task.py", line 102, in spawn_helper
    return_val = fn(**kwargs)
  File "/app/kfpytorch/pytorch_mnist.py", line 206, in mnist_pytorch_job
    datasets.MNIST(
  File "/opt/conda/lib/python3.10/site-packages/torchvision/datasets/mnist.py", line 99, in __init__
    self.download()
  File "/opt/conda/lib/python3.10/site-packages/torchvision/datasets/mnist.py", line 187, in download
    download_and_extract_archive(url, download_root=self.raw_folder, filename=filename, md5=md5)
  File "/opt/conda/lib/python3.10/site-packages/torchvision/datasets/utils.py", line 447, in download_and_extract_archive
    download_url(url, download_root, filename, md5)
  File "/opt/conda/lib/python3.10/site-packages/torchvision/datasets/utils.py", line 168, in download_url
    raise RuntimeError("File not found or corrupted.")
RuntimeError: File not found or corrupted.

cool-lifeguard-49380

06/12/2023, 4:20 PM

That’s not related to the distributed setup though.

Copy code

@task(task_config=Elastic(nnodes=1, nproc_per_node=4))
def train():
    import torch.distributed as dist
    dist.init_process_group(backend="nccl"/"gloo")
    dist.barrier()

cool-lifeguard-49380

06/12/2023, 4:21 PM

To debug the distributed setup, you could use this minimal example (choose the desired backend).

cool-lifeguard-49380

06/12/2023, 4:21 PM

Plugin config looks right

shy-accountant-549

06/12/2023, 5:20 PM

the minimal example works for nnodes=1, logs:

Copy code

tar: Removing leading `/' from member names

Closing process 22 via signal SIGTERM
Closing process 25 via signal SIGTERM

however for nnodes=2 it failed with

RendezvousTimeoutError

, and it is not running as a pytorchjob

cool-lifeguard-49380

06/12/2023, 5:48 PM

Which version is your flyte propeller image?

cool-lifeguard-49380

06/12/2023, 5:48 PM

The elastic training is fairly new, you need 1.6.0 at least.

shy-accountant-549

06/12/2023, 5:51 PM

I deployed https://artifacthub.io/packages/helm/flyte/flyte 1.6.2

shy-accountant-549

06/12/2023, 5:54 PM

is the flyte propeller image the same as the helm chart version?

cool-lifeguard-49380

06/12/2023, 5:56 PM

Can you run

kubectl -n flyte get pod

to find out the flytepropeller pod name and then

kubectl -n flyte get pod <flytepropeller pod name> -o yaml | grep image

cool-lifeguard-49380

06/12/2023, 5:57 PM

is the flyte propeller image the same as the helm chart version?

I think so but not 100% sure 🤷

cool-lifeguard-49380

06/12/2023, 5:58 PM

Can you please also run

pip show flyteidl

shy-accountant-549

06/12/2023, 6:02 PM

there is only one pod

flyte-binary

, container image is cr.flyte.org/flyteorg/flyte-binary-release:v1.6.2

shy-accountant-549

06/12/2023, 6:04 PM

pip show flyteidl

on my local machine and the task container:

Copy code

Name: flyteidl
Version: 1.5.10
Summary: IDL for Flyte Platform
Home-page: <https://www.github.com/flyteorg/flyteidl>
Author: 
Author-email: 
License: 
Location: /opt/homebrew/Caskroom/miniconda/base/envs/glass/lib/python3.10/site-packages
Requires: googleapis-common-protos, protobuf, protoc-gen-swagger
Required-by: flytekit, flytekitplugins-kfpytorch

cool-lifeguard-49380

06/12/2023, 6:16 PM

Ok, single binary 👍

cool-lifeguard-49380

06/12/2023, 6:17 PM

Let’s do another sanity check, could you please replace the task_config=Elastic with PyTorch(num_workers=1)?

👀 1

cool-lifeguard-49380

06/12/2023, 6:17 PM

This one uses the pytorch operator as well but has been part of flyte for a long time.

cool-lifeguard-49380

06/12/2023, 6:17 PM

I want to figure out whether the problem is that there is “some component that is not new enough for elastic training to work”.

shy-accountant-549

06/12/2023, 6:22 PM

Copy code

Traceback (most recent call last):

      File "/opt/conda/lib/python3.10/site-packages/flytekit/exceptions/scopes.py", line 206, in user_entry_point
        return wrapped(*args, **kwargs)
      File "/app/kfpytorch/simple.py", line 10, in train
        dist.init_process_group(backend="gloo")
      File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group
        store, rank, world_size = next(rendezvous_iterator)
      File "/opt/conda/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 236, in _env_rendezvous_handler
        rank = int(_get_env_or_raise("RANK"))
      File "/opt/conda/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 221, in _get_env_or_raise
        raise _env_error(env_var)

Message:

    Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

cool-lifeguard-49380

06/12/2023, 6:23 PM

So also with

task_config=PyTorch

it starts a normal pod and not distributed training?

shy-accountant-549

06/12/2023, 6:24 PM

correct

cool-lifeguard-49380

06/12/2023, 6:26 PM

Ok then the problem is not that something is not new enough, good to know.

cool-lifeguard-49380

06/12/2023, 6:26 PM

The training operator pod is running in the kubeflow namespace?

cool-lifeguard-49380

06/12/2023, 6:27 PM

`kubectl`get pytorchjobs.kubeflow.org also doesn’t complain that the resource type doesn’t exist?

shy-accountant-549

06/12/2023, 6:28 PM

no. I was able to run the smoke test:

Copy code

(base)  % kubectl get <http://pytorchjobs.kubeflow.org|pytorchjobs.kubeflow.org>                                                                                                                                                                                                                      tmp/tmp (main ⚡) protopias-mbp
NAME                          STATE       AGE
pytorch-dist-basic-sendrecv   Succeeded   3h48m

cool-lifeguard-49380

06/12/2023, 6:30 PM

Mh 🤔

cool-lifeguard-49380

06/12/2023, 6:30 PM

Copy code

enabled_plugins:
    tasks:
      task-plugins:
        enabled-plugins:
          - container
          - sidecar
          - k8s-array
          - pytorch
        default-for-task-types:
          container: container
          sidecar: sidecar
          container_array: k8s-array
          pytorch: pytorch

cool-lifeguard-49380

06/12/2023, 6:31 PM

This is the part of our config which activates the pytorch plugin.

cool-lifeguard-49380

06/12/2023, 6:31 PM

Same as yours though, isn’t it?

shy-accountant-549

06/12/2023, 6:32 PM

how did you apply the config, through helm values?

cool-lifeguard-49380

06/12/2023, 6:32 PM

yes

cool-lifeguard-49380

06/12/2023, 6:32 PM

We don’t run the single binary though, I cannot say whether the config needs to be different in this case unfortunately.

shy-accountant-549

06/12/2023, 6:34 PM

hmm, that could be the reason because for single binary I need to put it in the inline config section

Copy code

configuration:
  inline:
    plugins:
      k8s:
        inject-finalizer: true
        default-env-vars:
          - AWS_METADATA_SERVICE_TIMEOUT: 5
          - AWS_METADATA_SERVICE_NUM_ATTEMPTS: 20
        default-pod-template-name: flyte-task-template
      tasks:
        task-plugins:
          enabled-plugins:
            - container
            - sidecar
            - k8s-array
            - pytorch
          default-for-task-types:
            - container: container
            - sidecar: sidecar
            - container_array: k8s-array
            - pytorch: pytorch

shy-accountant-549

06/12/2023, 6:37 PM

the final cm looks

Copy code

apiVersion: v1
data:
  000-core.yaml: "admin:\n  endpoint: localhost:8089\n  insecure: true\ncatalog-cache:\n
    \ endpoint: localhost:8081\n  insecure: true\n  type: datacatalog\ncluster_resources:\n
    \ standaloneDeployment: false\n  templatePath: /etc/flyte/cluster-resource-templates\nlogger:\n
    \ show-source: true\n  level: 1\npropeller:\n  create-flyteworkflow-crd: true\nwebhook:\n
    \ certDir: /var/run/flyte/certs\n  localCert: true\n  secretName: sgs-flyte-binary-webhook-secret\n
    \ serviceName: sgs-flyte-binary-webhook\n  servicePort: 443\nflyte: \n  admin:\n
    \   disableClusterResourceManager: false\n    disableScheduler: false\n    disabled:
    false\n    seedProjects:\n    - flytesnacks\n  dataCatalog:\n    disabled: false\n
    \ propeller:\n    disableWebhook: false\n    disabled: false\n"
  001-plugins.yaml: |
    tasks:
      task-plugins:
        enabled-plugins:
          - container
          - sidecar
          - K8S-ARRAY
        default-for-task-types:
          - container: container
          - container_array: K8S-ARRAY
    plugins:
      logs:
        kubernetes-enabled: false
        cloudwatch-enabled: false
        stackdriver-enabled: false
      k8s:
        co-pilot:
          image: "<http://cr.flyte.org/flyteorg/flytecopilot-release:v1.6.2|cr.flyte.org/flyteorg/flytecopilot-release:v1.6.2>"
      k8s-array:
        logs:
          config:
            kubernetes-enabled: false
            cloudwatch-enabled: false
            stackdriver-enabled: false
  010-inline-config.yaml: |
    plugins:
      k8s:
        default-env-vars:
        - AWS_METADATA_SERVICE_TIMEOUT: 5
        - AWS_METADATA_SERVICE_NUM_ATTEMPTS: 20
        default-pod-template-name: flyte-task-template
        inject-finalizer: true
      tasks:
        task-plugins:
          default-for-task-types:
          - container: container
          - sidecar: sidecar
          - container_array: k8s-array
          - pytorch: pytorch
          enabled-plugins:
          - container
          - sidecar
          - k8s-array
          - pytorch
    storage:
      cache:
        max_size_mbs: 100
        target_gc_percent: 100
    task_resources:
      defaults:
        cpu: 200m
        memory: 1Gi
        storage: 128Gi
      limits:
        cpu: 64
        gpu: 8
        memory: 488Gi
        storage: 8Ti
kind: ConfigMap

cool-lifeguard-49380

06/12/2023, 6:44 PM

Here

001-plugins.yaml: |

it’s not configured?

shy-accountant-549

06/12/2023, 6:49 PM

yeah like I said i can only provide the config values through

010-inline-config.yaml

to the flyte binary chart

shy-accountant-549

06/12/2023, 6:49 PM

I am trying patching the plugins.yaml directly with kustomize now

cool-lifeguard-49380

06/12/2023, 6:54 PM

You could, as a hack, delete the training operator and the pytorchjob crd and then install flyte. If the pytorch plugin is correctly configured, flytepropeller will fail to start if the pytorchjob crd is not there.

cool-lifeguard-49380

06/12/2023, 6:55 PM

This might give you an easy way to identify when the pytorch plugin part in the config is being honored.

shy-accountant-549

06/12/2023, 6:58 PM

will flyte propeller fail on executing or registering the workflow?

shy-accountant-549

06/12/2023, 8:15 PM

I am trying patching the plugins.yaml directly with kustomize now

this works! below is the patch

Copy code

patches:
  - target:
      kind: ConfigMap
      name: flyte-binary-config
    patch: |-
      - op: replace
        path: /data/001-plugins.yaml
        value: |
          tasks:
            task-plugins:
              enabled-plugins:
                - container
                - sidecar
                - k8s-array
                - pytorch
              default-for-task-types:
                container: container
                sidecar: sidecar
                container_array: k8s-array
                pytorch: pytorch
          plugins:
            logs:
              kubernetes-enabled: false
              cloudwatch-enabled: false
              stackdriver-enabled: false
            k8s:
              co-pilot:
                image: "<http://cr.flyte.org/flyteorg/flytecopilot-release:v1.6.2|cr.flyte.org/flyteorg/flytecopilot-release:v1.6.2>"
            k8s-array:
              logs:
                config:
                  kubernetes-enabled: false
                  cloudwatch-enabled: false
                  stackdriver-enabled: false

shy-accountant-549

06/12/2023, 8:17 PM

so it seems for flyte binary the inline config values do not work for plugins. it has to be patched directly on

001-plugins.yaml

cool-lifeguard-49380

06/12/2023, 10:59 PM

cool that it works 🙂

shy-accountant-549

06/15/2023, 12:19 AM

Thank you for all the help!

cool-lifeguard-49380

06/15/2023, 8:21 AM

🙂

214 Views

Open in Slack

Previous Next