Nan Qin
06/09/2023, 4:47 PMFabio Grätz
06/09/2023, 5:12 PMNan Qin
06/09/2023, 5:43 PMFabio Grätz
06/10/2023, 9:19 AMNan Qin
06/12/2023, 2:58 PMRendezvousTimeoutError
on the mnist_pytorch_job
task (see code snippets below). Also the task is not running as a pytorchjob
but instead just a regular pod. Is it expected or am I missing something?
@task(
task_config=Elastic(
nnodes=2,
nproc_per_node=4,
),
retries=2,
cache=True,
cache_version="1.0",
requests=Resources(cpu=cpu_request, mem=mem_request, gpu=gpu_request),
limits=Resources(mem=mem_limit, gpu=gpu_limit),
)
def mnist_pytorch_job(hp: Hyperparameters) -> TrainingOutputs:
Fabio Grätz
06/12/2023, 3:33 PMNan Qin
06/12/2023, 3:37 PMconfiguration:
inline:
plugins:
k8s:
inject-finalizer: true
default-env-vars:
- AWS_METADATA_SERVICE_TIMEOUT: 5
- AWS_METADATA_SERVICE_NUM_ATTEMPTS: 20
default-pod-template-name: flyte-task-template
tasks:
task-plugins:
enabled-plugins:
- container
- sidecar
- k8s-array
- pytorch
default-for-task-types:
- container: container
- sidecar: sidecar
- container_array: k8s-array
- pytorch: pytorch
As a first step: set nnodes to 1 (in this case it’s expected to run in a normal pod) and test whether it works then.failed on downloading the mnist data 🥲
Use cuda True
Using device: cuda, world size: 2
Using distributed PyTorch with gloo backend
Downloading <http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz>
Downloading <http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz> to ../data/MNIST/raw/train-images-idx3-ubyte.gz
100%|██████████| 9912422/9912422 [00:00<00:00, 268571741.14it/s]
Use cuda True
Using device: cuda, world size: 2
Using distributed PyTorch with gloo backend
Downloading <http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz>
Downloading <http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz> to ../data/MNIST/raw/train-images-idx3-ubyte.gz
Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw
Use cuda True
Using device: cuda, world size: 2
Using distributed PyTorch with gloo backend
Downloading <http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz>
Downloading <http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz> to ../data/MNIST/raw/train-images-idx3-ubyte.gz
Use cuda True
Using device: cuda, world size: 2
Using distributed PyTorch with gloo backend
Downloading <http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz>
Downloading <http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz> to ../data/MNIST/raw/train-images-idx3-ubyte.gz
100%|██████████| 9912422/9912422 [00:00<00:00, 97443910.64it/s]
100%|██████████| 9912422/9912422 [00:00<00:00, 104604826.32it/s]
100%|██████████| 9912422/9912422 [00:00<00:00, 32470167.20it/s]
Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw
Downloading <http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz>
Downloading <http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz> to ../data/MNIST/raw/train-labels-idx1-ubyte.gz
100%|██████████| 28881/28881 [00:00<00:00, 141019434.02it/s]
Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw
Downloading <http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz>
Downloading <http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz> to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz
100%|██████████| 1648877/1648877 [00:00<00:00, 9279339.05it/s]
Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw
Downloading <http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz>
Downloading <http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz> to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz
100%|██████████| 4542/4542 [00:00<00:00, 28561512.40it/s]
failed (exitcode: 1) local_rank: 1 (pid: 28) of fn: spawn_helper (start_method: spawn)
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 455, in _poll
self._pc.join(-1)
File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 371, in _wrap
ret = record(fn)(*args_)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/flytekitplugins/kfpytorch/task.py", line 102, in spawn_helper
return_val = fn(**kwargs)
File "/app/kfpytorch/pytorch_mnist.py", line 206, in mnist_pytorch_job
datasets.MNIST(
File "/opt/conda/lib/python3.10/site-packages/torchvision/datasets/mnist.py", line 99, in __init__
self.download()
File "/opt/conda/lib/python3.10/site-packages/torchvision/datasets/mnist.py", line 187, in download
download_and_extract_archive(url, download_root=self.raw_folder, filename=filename, md5=md5)
File "/opt/conda/lib/python3.10/site-packages/torchvision/datasets/utils.py", line 447, in download_and_extract_archive
download_url(url, download_root, filename, md5)
File "/opt/conda/lib/python3.10/site-packages/torchvision/datasets/utils.py", line 168, in download_url
raise RuntimeError("File not found or corrupted.")
RuntimeError: File not found or corrupted.
Fabio Grätz
06/12/2023, 4:20 PM@task(task_config=Elastic(nnodes=1, nproc_per_node=4))
def train():
import torch.distributed as dist
dist.init_process_group(backend="nccl"/"gloo")
dist.barrier()
Nan Qin
06/12/2023, 5:20 PMtar: Removing leading `/' from member names
Closing process 22 via signal SIGTERM
Closing process 25 via signal SIGTERM
however for nnodes=2 it failed with RendezvousTimeoutError
, and it is not running as a pytorchjobFabio Grätz
06/12/2023, 5:48 PMNan Qin
06/12/2023, 5:51 PMFabio Grätz
06/12/2023, 5:56 PMkubectl -n flyte get pod
to find out the flytepropeller pod name and then kubectl -n flyte get pod <flytepropeller pod name> -o yaml | grep image
?is the flyte propeller image the same as the helm chart version?I think so but not 100% sure 🤷
pip show flyteidl
?Nan Qin
06/12/2023, 6:02 PMflyte-binary
, container image is cr.flyte.org/flyteorg/flyte-binary-release:v1.6.2pip show flyteidl
on my local machine and the task container:
Name: flyteidl
Version: 1.5.10
Summary: IDL for Flyte Platform
Home-page: <https://www.github.com/flyteorg/flyteidl>
Author:
Author-email:
License:
Location: /opt/homebrew/Caskroom/miniconda/base/envs/glass/lib/python3.10/site-packages
Requires: googleapis-common-protos, protobuf, protoc-gen-swagger
Required-by: flytekit, flytekitplugins-kfpytorch
Fabio Grätz
06/12/2023, 6:16 PMNan Qin
06/12/2023, 6:22 PMTraceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/flytekit/exceptions/scopes.py", line 206, in user_entry_point
return wrapped(*args, **kwargs)
File "/app/kfpytorch/simple.py", line 10, in train
dist.init_process_group(backend="gloo")
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 236, in _env_rendezvous_handler
rank = int(_get_env_or_raise("RANK"))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 221, in _get_env_or_raise
raise _env_error(env_var)
Message:
Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
Fabio Grätz
06/12/2023, 6:23 PMtask_config=PyTorch
it starts a normal pod and not distributed training?Nan Qin
06/12/2023, 6:24 PMFabio Grätz
06/12/2023, 6:26 PMNan Qin
06/12/2023, 6:28 PM(base) % kubectl get <http://pytorchjobs.kubeflow.org|pytorchjobs.kubeflow.org> tmp/tmp (main ⚡) protopias-mbp
NAME STATE AGE
pytorch-dist-basic-sendrecv Succeeded 3h48m
Fabio Grätz
06/12/2023, 6:30 PMenabled_plugins:
tasks:
task-plugins:
enabled-plugins:
- container
- sidecar
- k8s-array
- pytorch
default-for-task-types:
container: container
sidecar: sidecar
container_array: k8s-array
pytorch: pytorch
Nan Qin
06/12/2023, 6:32 PMFabio Grätz
06/12/2023, 6:32 PMNan Qin
06/12/2023, 6:34 PMconfiguration:
inline:
plugins:
k8s:
inject-finalizer: true
default-env-vars:
- AWS_METADATA_SERVICE_TIMEOUT: 5
- AWS_METADATA_SERVICE_NUM_ATTEMPTS: 20
default-pod-template-name: flyte-task-template
tasks:
task-plugins:
enabled-plugins:
- container
- sidecar
- k8s-array
- pytorch
default-for-task-types:
- container: container
- sidecar: sidecar
- container_array: k8s-array
- pytorch: pytorch
apiVersion: v1
data:
000-core.yaml: "admin:\n endpoint: localhost:8089\n insecure: true\ncatalog-cache:\n
\ endpoint: localhost:8081\n insecure: true\n type: datacatalog\ncluster_resources:\n
\ standaloneDeployment: false\n templatePath: /etc/flyte/cluster-resource-templates\nlogger:\n
\ show-source: true\n level: 1\npropeller:\n create-flyteworkflow-crd: true\nwebhook:\n
\ certDir: /var/run/flyte/certs\n localCert: true\n secretName: sgs-flyte-binary-webhook-secret\n
\ serviceName: sgs-flyte-binary-webhook\n servicePort: 443\nflyte: \n admin:\n
\ disableClusterResourceManager: false\n disableScheduler: false\n disabled:
false\n seedProjects:\n - flytesnacks\n dataCatalog:\n disabled: false\n
\ propeller:\n disableWebhook: false\n disabled: false\n"
001-plugins.yaml: |
tasks:
task-plugins:
enabled-plugins:
- container
- sidecar
- K8S-ARRAY
default-for-task-types:
- container: container
- container_array: K8S-ARRAY
plugins:
logs:
kubernetes-enabled: false
cloudwatch-enabled: false
stackdriver-enabled: false
k8s:
co-pilot:
image: "<http://cr.flyte.org/flyteorg/flytecopilot-release:v1.6.2|cr.flyte.org/flyteorg/flytecopilot-release:v1.6.2>"
k8s-array:
logs:
config:
kubernetes-enabled: false
cloudwatch-enabled: false
stackdriver-enabled: false
010-inline-config.yaml: |
plugins:
k8s:
default-env-vars:
- AWS_METADATA_SERVICE_TIMEOUT: 5
- AWS_METADATA_SERVICE_NUM_ATTEMPTS: 20
default-pod-template-name: flyte-task-template
inject-finalizer: true
tasks:
task-plugins:
default-for-task-types:
- container: container
- sidecar: sidecar
- container_array: k8s-array
- pytorch: pytorch
enabled-plugins:
- container
- sidecar
- k8s-array
- pytorch
storage:
cache:
max_size_mbs: 100
target_gc_percent: 100
task_resources:
defaults:
cpu: 200m
memory: 1Gi
storage: 128Gi
limits:
cpu: 64
gpu: 8
memory: 488Gi
storage: 8Ti
kind: ConfigMap
Fabio Grätz
06/12/2023, 6:44 PM001-plugins.yaml: |
it’s not configured?Nan Qin
06/12/2023, 6:49 PM010-inline-config.yaml
to the flyte binary chartFabio Grätz
06/12/2023, 6:54 PMNan Qin
06/12/2023, 6:58 PMI am trying patching the plugins.yaml directly with kustomize nowthis works! below is the patch
patches:
- target:
kind: ConfigMap
name: flyte-binary-config
patch: |-
- op: replace
path: /data/001-plugins.yaml
value: |
tasks:
task-plugins:
enabled-plugins:
- container
- sidecar
- k8s-array
- pytorch
default-for-task-types:
container: container
sidecar: sidecar
container_array: k8s-array
pytorch: pytorch
plugins:
logs:
kubernetes-enabled: false
cloudwatch-enabled: false
stackdriver-enabled: false
k8s:
co-pilot:
image: "<http://cr.flyte.org/flyteorg/flytecopilot-release:v1.6.2|cr.flyte.org/flyteorg/flytecopilot-release:v1.6.2>"
k8s-array:
logs:
config:
kubernetes-enabled: false
cloudwatch-enabled: false
stackdriver-enabled: false
001-plugins.yaml
Fabio Grätz
06/12/2023, 10:59 PMNan Qin
06/15/2023, 12:19 AMFabio Grätz
06/15/2023, 8:21 AM