ambitious-france-31318
08/04/2023, 7:38 PMUNKNOWN
status on every workflow that I submit to Flyte and it just stays in that state (it never evolved to a RUNNING
state).
Some background of the Flyte installation:
I have deployed Flyte on a local K8's before deploying it on our real K8's environment (sort of a POC). I have recently installed the MPI-Operator in order to be able to parallelize a ML workflow. Since I couldn't make an update of the Helm Chart because it was throwing the following error Error: UPGRADE FAILED: rendered manifests contain a resource that already exists. Unable to continue with update: Secret "kubernetes-dashboard-csrf" in namespace "flyte" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "<http://meta.helm.sh/release-name|meta.helm.sh/release-name>" must equal "flyte": current value is "flyte-deps"
I ended up modifying the flyte-core values file, adding to the ConfigMap
property the enabled_plugins
property in accordance to what the documentation says.
Question:
What could be happening and how could I check what's going on under the hood? BTW, from time to time.. it's not weird to get a 503
error when navigating the console.
Any help is greatly appreciated, thx!tall-lock-23197
ambitious-france-31318
08/07/2023, 7:09 PME0807 16:38:21.462968 1 workers.go:102] error syncing 'flytesnacks-development/fa6f1f44573a6451e9cb': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [mpi]: found no current condition. Conditions: []
. A job that is not an MPI Job executes correctly, im only getting this error when submitting an MPI Job, if you have any suggestions on what I could try I'll really appreciate it ๐ambitious-france-31318
08/07/2023, 7:11 PMtall-lock-23197
ambitious-france-31318
08/08/2023, 2:35 PMtall-lock-23197
~/flyte/sandbox/config.yaml
file with the relevant MPI configuration?tall-lock-23197
tall-lock-23197
ambitious-france-31318
08/08/2023, 5:37 PMambitious-france-31318
08/08/2023, 5:42 PMenabled-plugins
value the mpi plugin, the pods get restarted. If you still want to, I could try to relaunch the job once again, but im pretty sure that the job is already trying to be executed once the mpi plugin is enabledambitious-france-31318
08/08/2023, 7:11 PM# I needed this ones to install this two in order to install Horovod and make the compile with cmake
brew install pkg-config libuv
# This ones to work with a workflow that has an MPI task
pip install flytekitplugins-kfmpi tensorflow cmake
# Im guessing that this ones to install Horovod
HOROVOD_WITH_TENSORFLOW=1 pip install --no-cache-dir horovod kubeflow-training
I downloaded them when I was getting a different error (the job wasn't being submitted to Flyte, it was breaking saying something like "...[dependency_name] couldn't be foundโฆ" and once I installed them, I could submit the jobs to Flyte. Currently getting the error that I mentioned above E0807 16:38:21.462968 1 workers.go:102] error syncing 'flytesnacks-development/fa6f1f44573a6451e9cb': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [mpi]: found no current condition. Conditions: []
tall-lock-23197
tall-lock-23197
limited-raincoat-94253
08/09/2023, 6:42 AMlimited-raincoat-94253
08/09/2023, 6:42 AMambitious-france-31318
08/09/2023, 1:06 PMtall-lock-23197
FROM ubuntu:focal
LABEL org.opencontainers.image.source <https://github.com/flyteorg/flytesnacks>
WORKDIR /root
ENV VENV /opt/venv
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8
ENV PYTHONPATH /root
ENV DEBIAN_FRONTEND=noninteractive
# Install Python3 and other basics
RUN apt-get update \
&& apt-get install -y software-properties-common \
&& add-apt-repository ppa:ubuntu-toolchain-r/test \
&& add-apt-repository -y ppa:deadsnakes/ppa \
&& apt-get install -y \
build-essential \
cmake \
g++-7 \
curl \
git \
wget \
python3.10 \
python3.10-venv \
python3.10-dev \
make \
libssl-dev \
python3-pip \
python3-wheel \
libuv1
ENV VENV /opt/venv
# Virtual environment
RUN python3 -m venv ${VENV}
ENV PATH="${VENV}/bin:$PATH"
# Install wheel after venv is activated
RUN pip3 install wheel
# Install Open MPI
RUN wget --progress=dot:mega -O /tmp/openmpi-4.1.4-bin.tar.gz <https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.4.tar.gz> && \
cd /tmp && tar -zxf /tmp/openmpi-4.1.4-bin.tar.gz && \
mkdir openmpi-4.1.4/build && cd openmpi-4.1.4/build && ../configure --prefix=/usr/local && \
make -j all && make install && ldconfig && \
mpirun --version
# Allow OpenSSH to talk to containers without asking for confirmation
RUN mkdir -p /var/run/sshd
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config
# Install Python dependencies
COPY <http://requirements.in|requirements.in> /root
RUN pip install -r /root/requirements.in
# Install TensorFlow
RUN wget <https://tf.novaal.de/westmere/tensorflow-2.8.0-cp310-cp310-linux_x86_64.whl> && pip install tensorflow-2.8.0-cp310-cp310-linux_x86_64.whl
# Enable GPU
# ENV HOROVOD_GPU_OPERATIONS NCCL
RUN HOROVOD_WITH_MPI=1 pip install --no-cache-dir horovod==0.28.1
# Copy the actual code
COPY . /root/
# This tag is supplied by the build script and will be used to determine the version
# when registering tasks, workflows, and launch plans
ARG tag
ENV FLYTE_INTERNAL_IMAGE $tag
I'm also importing horovod
in the flyte task: https://gist.github.com/samhita-alla/8a83eaf8a6cc61d85301abc58242f939.tall-lock-23197
horovod
on my system.ambitious-france-31318
08/09/2023, 2:19 PMambitious-france-31318
08/09/2023, 5:33 PMambitious-france-31318
08/09/2023, 6:24 PMRuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [mpi]: found no current condition. Conditions: []
. The job is the same that appears on this docs , the only thing that I changed now is importing Horovod directly in the Flyte task (I really appreciate that you mentioned that, without that change it was throwing an error, I wouldn't have guess it)limited-raincoat-94253
08/09/2023, 6:25 PMlimited-raincoat-94253
08/09/2023, 6:25 PMlimited-raincoat-94253
08/09/2023, 6:26 PMkubectl get mpijob
ambitious-france-31318
08/09/2023, 6:26 PMpython3-venv
on the apt-get install
and installing TensorFlow directly from pip (and not using the whl directlyambitious-france-31318
08/09/2023, 6:27 PMambitious-france-31318
08/09/2023, 6:28 PMNo resources found in default namespace.
ambitious-france-31318
08/09/2023, 6:29 PM/bin/sh: kubectl: not found
limited-raincoat-94253
08/09/2023, 6:29 PMlimited-raincoat-94253
08/09/2023, 6:30 PMkubectl get mpijob -n flytesnacks-development
ambitious-france-31318
08/09/2023, 6:31 PMambitious-france-31318
08/09/2023, 6:31 PMambitious-france-31318
08/09/2023, 6:32 PMadz9c7r4w9lvj4phntxs-n0-0 17m Created
f691283563e5044edb39-n0-3 22h Failed
febd3eefb998145288f7-n0-3 4h40m Failed
ambitious-france-31318
08/09/2023, 6:33 PMambitious-france-31318
08/09/2023, 6:33 PMRUNNING
statelimited-raincoat-94253
08/09/2023, 6:39 PMlimited-raincoat-94253
08/09/2023, 6:40 PMambitious-france-31318
08/09/2023, 6:41 PMlimited-raincoat-94253
08/09/2023, 6:41 PMkubectl get -o yaml mpijobs adz9c7r4w9lvj4phntxs-n0-0 -n flytesnacks-development
ambitious-france-31318
08/09/2023, 6:41 PMambitious-france-31318
08/09/2023, 6:41 PMambitious-france-31318
08/09/2023, 6:43 PMkubectl get -o yaml mpijobs adz9c7r4w9lvj4phntxs-n0-0 -n flytesnacks-development
apiVersion: <http://kubeflow.org/v1|kubeflow.org/v1>
kind: MPIJob
metadata:
annotations:
<http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: "false"
creationTimestamp: "2023-08-09T18:13:55Z"
generation: 1
labels:
domain: development
execution-id: adz9c7r4w9lvj4phntxs
interruptible: "false"
node-id: n0
project: flytesnacks
shard-key: "10"
task-name: workflows-distributed-training-horovod-train-task
workflow-name: workflows-distributed-training-horovod-training-wf
name: adz9c7r4w9lvj4phntxs-n0-0
namespace: flytesnacks-development
ownerReferences:
- apiVersion: <http://flyte.lyft.com/v1alpha1|flyte.lyft.com/v1alpha1>
blockOwnerDeletion: true
controller: true
kind: flyteworkflow
name: adz9c7r4w9lvj4phntxs
uid: 7e8c648e-2ba5-4364-aca3-8643e7764bd3
resourceVersion: "58239"
uid: 51d3e0c2-1f46-4348-a5f2-8f58bbfa5c3e
spec:
mpiReplicaSpecs:
Launcher:
replicas: 1
restartPolicy: Never
template:
metadata: {}
spec:
affinity: {}
containers:
- args:
- mpirun
- --allow-run-as-root
- -bind-to
- none
- -map-by
- slot
- -x
- LD_LIBRARY_PATH
- -x
- PATH
- -x
- NCCL_DEBUG=INFO
- -mca
- pml
- ob1
- -mca
- btl
- ^openib
- -np
- "3"
- python
- /opt/venv/bin/entrypoint.py
- pyflyte-execute
- --inputs
- <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-adz9c7r4w9lvj4phntxs/n0/data/inputs.pb>
- --output-prefix
- <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-adz9c7r4w9lvj4phntxs/n0/data/0>
- --raw-output-data-prefix
- <s3://my-s3-bucket/q4/adz9c7r4w9lvj4phntxs-n0-0>
- --checkpoint-path
- <s3://my-s3-bucket/q4/adz9c7r4w9lvj4phntxs-n0-0/_flytecheckpoints>
- --prev-checkpoint
- '""'
- --resolver
- flytekit.core.python_auto_container.default_task_resolver
- --
- task-module
- workflows.distributed_training
- task-name
- horovod_train_task
env:
- name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
value: flytesnacks:development:workflows.distributed_training.horovod_training_wf
- name: FLYTE_INTERNAL_EXECUTION_ID
value: adz9c7r4w9lvj4phntxs
- name: FLYTE_INTERNAL_EXECUTION_PROJECT
value: flytesnacks
- name: FLYTE_INTERNAL_EXECUTION_DOMAIN
value: development
- name: FLYTE_ATTEMPT_NUMBER
value: "0"
- name: FLYTE_INTERNAL_TASK_PROJECT
value: flytesnacks
- name: FLYTE_INTERNAL_TASK_DOMAIN
value: development
- name: FLYTE_INTERNAL_TASK_NAME
value: workflows.distributed_training.horovod_train_task
- name: FLYTE_INTERNAL_TASK_VERSION
value: HEAD
- name: FLYTE_INTERNAL_PROJECT
value: flytesnacks
- name: FLYTE_INTERNAL_DOMAIN
value: development
- name: FLYTE_INTERNAL_NAME
value: workflows.distributed_training.horovod_train_task
- name: FLYTE_INTERNAL_VERSION
value: HEAD
- name: FLYTE_AWS_ENDPOINT
value: <http://minio.flyte.svc.cluster.local:9000>
- name: FLYTE_AWS_ACCESS_KEY_ID
value: minio
- name: FLYTE_AWS_SECRET_ACCESS_KEY
value: miniostorage
image: piloto_mpi:piloto
name: mpi
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 500m
memory: 1Gi
terminationMessagePolicy: FallbackToLogsOnError
restartPolicy: Never
Worker:
replicas: 3
restartPolicy: Never
template:
metadata: {}
spec:
affinity: {}
containers:
- args:
- mpirun
- --allow-run-as-root
- -bind-to
- none
- -map-by
- slot
- -x
- LD_LIBRARY_PATH
- -x
- PATH
- -x
- NCCL_DEBUG=INFO
- -mca
- pml
- ob1
- -mca
- btl
- ^openib
- -np
- "3"
- python
- /opt/venv/bin/entrypoint.py
- pyflyte-execute
- --inputs
- <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-adz9c7r4w9lvj4phntxs/n0/data/inputs.pb>
- --output-prefix
- <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-adz9c7r4w9lvj4phntxs/n0/data/0>
- --raw-output-data-prefix
- <s3://my-s3-bucket/q4/adz9c7r4w9lvj4phntxs-n0-0>
- --checkpoint-path
- <s3://my-s3-bucket/q4/adz9c7r4w9lvj4phntxs-n0-0/_flytecheckpoints>
- --prev-checkpoint
- '""'
- --resolver
- flytekit.core.python_auto_container.default_task_resolver
- --
- task-module
- workflows.distributed_training
- task-name
- horovod_train_task
env:
- name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
value: flytesnacks:development:workflows.distributed_training.horovod_training_wf
- name: FLYTE_INTERNAL_EXECUTION_ID
value: adz9c7r4w9lvj4phntxs
- name: FLYTE_INTERNAL_EXECUTION_PROJECT
value: flytesnacks
- name: FLYTE_INTERNAL_EXECUTION_DOMAIN
value: development
- name: FLYTE_ATTEMPT_NUMBER
value: "0"
- name: FLYTE_INTERNAL_TASK_PROJECT
value: flytesnacks
- name: FLYTE_INTERNAL_TASK_DOMAIN
value: development
- name: FLYTE_INTERNAL_TASK_NAME
value: workflows.distributed_training.horovod_train_task
- name: FLYTE_INTERNAL_TASK_VERSION
value: HEAD
- name: FLYTE_INTERNAL_PROJECT
value: flytesnacks
- name: FLYTE_INTERNAL_DOMAIN
value: development
- name: FLYTE_INTERNAL_NAME
value: workflows.distributed_training.horovod_train_task
- name: FLYTE_INTERNAL_VERSION
value: HEAD
- name: FLYTE_AWS_ENDPOINT
value: <http://minio.flyte.svc.cluster.local:9000>
- name: FLYTE_AWS_ACCESS_KEY_ID
value: minio
- name: FLYTE_AWS_SECRET_ACCESS_KEY
value: miniostorage
image: piloto_mpi:piloto
name: mpi
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 500m
memory: 1Gi
terminationMessagePolicy: FallbackToLogsOnError
restartPolicy: Never
runPolicy: {}
slotsPerWorker: 1
status:
conditions:
- lastTransitionTime: "2023-08-09T18:13:55Z"
lastUpdateTime: "2023-08-09T18:13:55Z"
message: MPIJob flytesnacks-development/adz9c7r4w9lvj4phntxs-n0-0 is created.
reason: MPIJobCreated
status: "True"
type: Created
replicaStatuses:
Launcher: {}
Worker:
active: 3
startTime: "2023-08-09T18:13:55Z"
limited-raincoat-94253
08/09/2023, 6:48 PMk get pods -n flytesnacks-development
limited-raincoat-94253
08/09/2023, 6:48 PMambitious-france-31318
08/09/2023, 6:49 PMambitious-france-31318
08/09/2023, 6:49 PMkubectl get pods -n flytesnacks-development
NAME READY STATUS RESTARTS AGE
adrfzmnrchw57npfkcj8-n0-0 0/1 Completed 0 36m
adrfzmnrchw57npfkcj8-n1-0 0/1 Completed 0 36m
adrfzmnrchw57npfkcj8-n2-0 0/1 Completed 0 36m
adz9c7r4w9lvj4phntxs-n0-0-launcher 0/1 Pending 0 34m
adz9c7r4w9lvj4phntxs-n0-0-worker-0 1/1 Running 0 34m
adz9c7r4w9lvj4phntxs-n0-0-worker-1 1/1 Running 0 34m
adz9c7r4w9lvj4phntxs-n0-0-worker-2 1/1 Running 0 34m
ak9m6bcvjcvx9vbr8t9z-n0-0 0/1 Completed 0 38m
ak9m6bcvjcvx9vbr8t9z-n1-0 0/1 Completed 0 38m
f691283563e5044edb39-n0-3-launcher 0/1 Error 0 22h
f691283563e5044edb39-n0-3-worker-0 0/1 Error 0 22h
f691283563e5044edb39-n0-3-worker-1 0/1 Error 0 22h
f691283563e5044edb39-n0-3-worker-2 0/1 Error 0 22h
f6f8b932f56c54fbdbe2-n0-0 0/1 Completed 0 22h
f6f8b932f56c54fbdbe2-n1-0 0/1 Completed 0 22h
f6f8b932f56c54fbdbe2-n2-0 0/1 Completed 0 22h
f80e038fa12f44843812-n0-0 0/1 Completed 0 5h1m
f80e038fa12f44843812-n1-0 0/1 Completed 0 5h1m
f80e038fa12f44843812-n2-0 0/1 Completed 0 5h
febd3eefb998145288f7-n0-3-launcher 0/1 Error 0 4h58m
febd3eefb998145288f7-n0-3-worker-0 1/1 Running 0 4h58m
febd3eefb998145288f7-n0-3-worker-1 1/1 Running 0 4h58m
febd3eefb998145288f7-n0-3-worker-2 1/1 Running 0 4h58m
ambitious-france-31318
08/09/2023, 6:50 PMlimited-raincoat-94253
08/09/2023, 6:51 PMkubectl logs f691283563e5044edb39-n0-3-launcher -n flytesnacks-development
ambitious-france-31318
08/09/2023, 6:51 PMambitious-france-31318
08/09/2023, 6:52 PMkubectl logs f691283563e5044edb39-n0-3-launcher -n flytesnacks-development
Defaulted container "mpi" out of: mpi, kubectl-delivery (init)
unable to retrieve container logs for <docker://078ad2551be44d24c5c0050a75bd2b03a14c42fa07a7cc1a4adc401c6a0d385>0%
limited-raincoat-94253
08/09/2023, 6:52 PMambitious-france-31318
08/09/2023, 6:54 PMkubectl describe pod f691283563e5044edb39-n0-3-launcher -n flytesnacks-development
Name: f691283563e5044edb39-n0-3-launcher
Namespace: flytesnacks-development
Priority: 0
Service Account: f691283563e5044edb39-n0-3-launcher
Node: docker-desktop/192.168.65.4
Start Time: Tue, 08 Aug 2023 17:11:09 -0300
Labels: <http://training.kubeflow.org/job-name=f691283563e5044edb39-n0-3|training.kubeflow.org/job-name=f691283563e5044edb39-n0-3>
<http://training.kubeflow.org/job-role=master|training.kubeflow.org/job-role=master>
<http://training.kubeflow.org/operator-name=mpijob-controller|training.kubeflow.org/operator-name=mpijob-controller>
<http://training.kubeflow.org/replica-type=launcher|training.kubeflow.org/replica-type=launcher>
Annotations: <none>
Status: Failed
IP: 10.1.0.83
IPs:
IP: 10.1.0.83
Controlled By: MPIJob/f691283563e5044edb39-n0-3
Init Containers:
kubectl-delivery:
Container ID: <docker://dee3fc522e997f022eb4d76a024fabad8c33f04716472350eb2a5eb4039e8a6>0
Image: mpioperator/kubectl-delivery:latest
Image ID: <docker-pullable://mpioperator/kubectl-delivery@sha256:8a4a24114e0bdc8df8f44e657baa6f5d47b24b1664b26c6f59e06575f8f21a55>
Port: <none>
Host Port: <none>
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 08 Aug 2023 17:11:10 -0300
Finished: Tue, 08 Aug 2023 17:11:17 -0300
Ready: True
Restart Count: 0
Limits:
cpu: 100m
ephemeral-storage: 5Gi
memory: 512Mi
Requests:
cpu: 100m
ephemeral-storage: 5Gi
memory: 512Mi
Environment:
TARGET_DIR: /opt/kube
NAMESPACE: flytesnacks-development
Mounts:
/etc/mpi from mpi-job-config (rw)
/opt/kube from mpi-job-kubectl (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-z4whm (ro)
Containers:
mpi:
Container ID: <docker://078ad2551be44d24c5c0050a75bd2b03a14c42fa07a7cc1a4adc401c6a0d385>0
Image: <http://cr.flyte.org/flyteorg/flytekit:py3.11-1.8.1|cr.flyte.org/flyteorg/flytekit:py3.11-1.8.1>
Image ID: <docker-pullable://cr.flyte.org/flyteorg/flytekit@sha256:07e13d5a3f49b918dcc323a1cb6f01c455b0c71fb46d784b3b958ba919afcc62>
Port: <none>
Host Port: <none>
Args:
pyflyte-fast-execute
--additional-distribution
<s3://my-s3-bucket/flytesnacks/development/A6JBX2NT37TMAF76N7B4ICKM6I======/script_mode.tar.gz>
--dest-dir
/root
--
mpirun
--allow-run-as-root
-bind-to
none
-map-by
slot
-x
LD_LIBRARY_PATH
-x
PATH
-x
NCCL_DEBUG=INFO
-mca
pml
ob1
-mca
btl
^openib
-np
3
python
/opt/venv/bin/entrypoint.py
pyflyte-execute
--inputs
<s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f691283563e5044edb39/n0/data/inputs.pb>
--output-prefix
<s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f691283563e5044edb39/n0/data/3>
--raw-output-data-prefix
<s3://my-s3-bucket/54/f691283563e5044edb39-n0-3>
--checkpoint-path
<s3://my-s3-bucket/54/f691283563e5044edb39-n0-3/_flytecheckpoints>
--prev-checkpoint
<s3://my-s3-bucket/u3/f691283563e5044edb39-n0-2/_flytecheckpoints>
--resolver
flytekit.core.python_auto_container.default_task_resolver
--
task-module
distributed_training
task-name
horovod_train_task
State: Terminated
Reason: Error
Message: โ
โ /usr/local/lib/python3.11/site-packages/click/core.py:783 in invoke โ
โ โ
โ โฑ 783 โ โ โ โ return __callback(*args, **kwargs) โ
โ โ
โ /usr/local/lib/python3.11/site-packages/flytekit/bin/entrypoint.py:517 in โ
โ fast_execute_task_cmd โ
โ โ
โ โฑ 517 โ p = subprocess.run(cmd, check=False) โ
โ โ
โ /usr/local/lib/python3.11/subprocess.py:548 in run โ
โ โ
โ โฑ 548 โ with Popen(*popenargs, **kwargs) as process: โ
โ โ
โ /usr/local/lib/python3.11/subprocess.py:1026 in __init__ โ
โ โ
โ โฑ 1026 โ โ โ self._execute_child(args, executable, preexec_fn, close_f โ
โ โ
โ /usr/local/lib/python3.11/subprocess.py:1950 in _execute_child โ
โ โ
โ โฑ 1950 โ โ โ โ โ raise child_exception_type(errno_num, err_msg, er โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
FileNotFoundError: [Errno 2] No such file or directory: 'mpirun'
Exit Code: 1
Started: Tue, 08 Aug 2023 17:11:17 -0300
Finished: Tue, 08 Aug 2023 17:11:19 -0300
Ready: False
Restart Count: 0
Limits:
cpu: 500m
memory: 1Gi
Requests:
cpu: 500m
memory: 1Gi
Environment:
FLYTE_INTERNAL_EXECUTION_WORKFLOW: flytesnacks:development:distributed_training.horovod_training_wf
FLYTE_INTERNAL_EXECUTION_ID: f691283563e5044edb39
FLYTE_INTERNAL_EXECUTION_PROJECT: flytesnacks
FLYTE_INTERNAL_EXECUTION_DOMAIN: development
FLYTE_ATTEMPT_NUMBER: 3
FLYTE_INTERNAL_TASK_PROJECT: flytesnacks
FLYTE_INTERNAL_TASK_DOMAIN: development
FLYTE_INTERNAL_TASK_NAME: distributed_training.horovod_train_task
FLYTE_INTERNAL_TASK_VERSION: c1os32vnshq5Bl9727lWxA==
FLYTE_INTERNAL_PROJECT: flytesnacks
FLYTE_INTERNAL_DOMAIN: development
FLYTE_INTERNAL_NAME: distributed_training.horovod_train_task
FLYTE_INTERNAL_VERSION: c1os32vnshq5Bl9727lWxA==
FLYTE_AWS_ACCESS_KEY_ID: minio
FLYTE_AWS_SECRET_ACCESS_KEY: miniostorage
FLYTE_AWS_ENDPOINT: <http://minio.flyte.svc.cluster.local:9000>
OMPI_MCA_plm_rsh_agent: /etc/mpi/kubexec.sh
OMPI_MCA_orte_default_hostfile: /etc/mpi/hostfile
NVIDIA_VISIBLE_DEVICES:
NVIDIA_DRIVER_CAPABILITIES:
Mounts:
/etc/mpi from mpi-job-config (rw)
/opt/kube from mpi-job-kubectl (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-z4whm (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
mpi-job-kubectl:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
mpi-job-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: f691283563e5044edb39-n0-3-config
Optional: false
kube-api-access-z4whm:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
<http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events: <none>
limited-raincoat-94253
08/09/2023, 6:56 PM<http://cr.flyte.org/flyteorg/flytekit:py3.11-1.8.1|cr.flyte.org/flyteorg/flytekit:py3.11-1.8.1>
probably does not have mpi installed correctlyambitious-france-31318
08/09/2023, 6:56 PMFileNotFoundError: [Errno 2] No such file or directory: 'mpirun'
even though the image has an installation of MPI. I mean, if it's working with the one that I createdambitious-france-31318
08/09/2023, 6:57 PMambitious-france-31318
08/09/2023, 6:58 PMlimited-raincoat-94253
08/09/2023, 6:58 PMlimited-raincoat-94253
08/09/2023, 6:58 PMlimited-raincoat-94253
08/09/2023, 6:58 PMlimited-raincoat-94253
08/09/2023, 6:59 PMambitious-france-31318
08/09/2023, 6:59 PMambitious-france-31318
08/09/2023, 6:59 PMlimited-raincoat-94253
08/09/2023, 6:59 PMlimited-raincoat-94253
08/09/2023, 6:59 PMlimited-raincoat-94253
08/09/2023, 6:59 PMambitious-france-31318
08/09/2023, 6:59 PMambitious-france-31318
08/09/2023, 7:00 PMambitious-france-31318
08/09/2023, 7:00 PMambitious-france-31318
08/09/2023, 7:01 PMlimited-raincoat-94253
08/09/2023, 7:01 PMlimited-raincoat-94253
08/09/2023, 7:02 PMambitious-france-31318
08/09/2023, 7:04 PMambitious-france-31318
08/09/2023, 7:05 PM@task(image="<image_name>")
, now it is @task(container_image="<image_name>")
?limited-raincoat-94253
08/09/2023, 7:06 PMlimited-raincoat-94253
08/09/2023, 7:06 PMambitious-france-31318
08/09/2023, 7:07 PMlimited-raincoat-94253
08/09/2023, 7:08 PMlimited-raincoat-94253
08/09/2023, 7:08 PMambitious-france-31318
08/09/2023, 7:10 PMambitious-france-31318
08/09/2023, 7:11 PMlimited-raincoat-94253
08/09/2023, 7:12 PMlimited-raincoat-94253
08/09/2023, 7:12 PMambitious-france-31318
08/09/2023, 7:12 PMlimited-raincoat-94253
08/09/2023, 7:12 PMambitious-france-31318
08/09/2023, 7:13 PMambitious-france-31318
08/09/2023, 7:14 PMambitious-france-31318
08/09/2023, 7:15 PMlimited-raincoat-94253
08/09/2023, 7:17 PMambitious-france-31318
08/09/2023, 7:21 PMambitious-france-31318
08/09/2023, 7:22 PMlimited-raincoat-94253
08/09/2023, 7:23 PMdescribe pod
gives you the correct image you specifiedlimited-raincoat-94253
08/09/2023, 7:23 PMambitious-france-31318
08/09/2023, 7:24 PMambitious-france-31318
08/09/2023, 7:28 PMkubectl get -o yaml mpijobs a7nfz7zcpnczvmnhqt6s-n0-0 -n flytesnacks-development
apiVersion: <http://kubeflow.org/v1|kubeflow.org/v1>
kind: MPIJob
metadata:
annotations:
<http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: "false"
creationTimestamp: "2023-08-09T19:18:13Z"
generation: 1
labels:
domain: development
execution-id: a7nfz7zcpnczvmnhqt6s
interruptible: "false"
node-id: n0
project: flytesnacks
shard-key: "3"
task-name: workflows-distributed-training-horovod-train-task
workflow-name: workflows-distributed-training-horovod-training-wf
name: a7nfz7zcpnczvmnhqt6s-n0-0
namespace: flytesnacks-development
ownerReferences:
- apiVersion: <http://flyte.lyft.com/v1alpha1|flyte.lyft.com/v1alpha1>
blockOwnerDeletion: true
controller: true
kind: flyteworkflow
name: a7nfz7zcpnczvmnhqt6s
uid: cef3f28b-bdf3-471a-bac3-0ace667f44d6
resourceVersion: "70681"
uid: 7d3e4d21-bd7d-424e-b0d4-af808d238d0c
spec:
mpiReplicaSpecs:
Launcher:
replicas: 1
restartPolicy: Never
template:
metadata: {}
spec:
affinity: {}
containers:
- args:
- mpirun
- --allow-run-as-root
- -bind-to
- none
- -map-by
- slot
- -x
- LD_LIBRARY_PATH
- -x
- PATH
- -x
- NCCL_DEBUG=INFO
- -mca
- pml
- ob1
- -mca
- btl
- ^openib
- -np
- "3"
- python
- /opt/venv/bin/entrypoint.py
- pyflyte-execute
- --inputs
- <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-a7nfz7zcpnczvmnhqt6s/n0/data/inputs.pb>
- --output-prefix
- <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-a7nfz7zcpnczvmnhqt6s/n0/data/0>
- --raw-output-data-prefix
- <s3://my-s3-bucket/ar/a7nfz7zcpnczvmnhqt6s-n0-0>
- --checkpoint-path
- <s3://my-s3-bucket/ar/a7nfz7zcpnczvmnhqt6s-n0-0/_flytecheckpoints>
- --prev-checkpoint
- '""'
- --resolver
- flytekit.core.python_auto_container.default_task_resolver
- --
- task-module
- workflows.distributed_training
- task-name
- horovod_train_task
env:
- name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
value: flytesnacks:development:workflows.distributed_training.horovod_training_wf
- name: FLYTE_INTERNAL_EXECUTION_ID
value: a7nfz7zcpnczvmnhqt6s
- name: FLYTE_INTERNAL_EXECUTION_PROJECT
value: flytesnacks
- name: FLYTE_INTERNAL_EXECUTION_DOMAIN
value: development
- name: FLYTE_ATTEMPT_NUMBER
value: "0"
- name: FLYTE_INTERNAL_TASK_PROJECT
value: flytesnacks
- name: FLYTE_INTERNAL_TASK_DOMAIN
value: development
- name: FLYTE_INTERNAL_TASK_NAME
value: workflows.distributed_training.horovod_train_task
- name: FLYTE_INTERNAL_TASK_VERSION
value: "2"
- name: FLYTE_INTERNAL_PROJECT
value: flytesnacks
- name: FLYTE_INTERNAL_DOMAIN
value: development
- name: FLYTE_INTERNAL_NAME
value: workflows.distributed_training.horovod_train_task
- name: FLYTE_INTERNAL_VERSION
value: "2"
- name: FLYTE_AWS_SECRET_ACCESS_KEY
value: miniostorage
- name: FLYTE_AWS_ENDPOINT
value: <http://minio.flyte.svc.cluster.local:9000>
- name: FLYTE_AWS_ACCESS_KEY_ID
value: minio
image: piloto_mpi:piloto
name: mpi
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 500m
memory: 1Gi
terminationMessagePolicy: FallbackToLogsOnError
restartPolicy: Never
Worker:
replicas: 3
restartPolicy: Never
template:
metadata: {}
spec:
affinity: {}
containers:
- args:
- mpirun
- --allow-run-as-root
- -bind-to
- none
- -map-by
- slot
- -x
- LD_LIBRARY_PATH
- -x
- PATH
- -x
- NCCL_DEBUG=INFO
- -mca
- pml
- ob1
- -mca
- btl
- ^openib
- -np
- "3"
- python
- /opt/venv/bin/entrypoint.py
- pyflyte-execute
- --inputs
- <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-a7nfz7zcpnczvmnhqt6s/n0/data/inputs.pb>
- --output-prefix
- <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-a7nfz7zcpnczvmnhqt6s/n0/data/0>
- --raw-output-data-prefix
- <s3://my-s3-bucket/ar/a7nfz7zcpnczvmnhqt6s-n0-0>
- --checkpoint-path
- <s3://my-s3-bucket/ar/a7nfz7zcpnczvmnhqt6s-n0-0/_flytecheckpoints>
- --prev-checkpoint
- '""'
- --resolver
- flytekit.core.python_auto_container.default_task_resolver
- --
- task-module
- workflows.distributed_training
- task-name
- horovod_train_task
env:
- name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
value: flytesnacks:development:workflows.distributed_training.horovod_training_wf
- name: FLYTE_INTERNAL_EXECUTION_ID
value: a7nfz7zcpnczvmnhqt6s
- name: FLYTE_INTERNAL_EXECUTION_PROJECT
value: flytesnacks
- name: FLYTE_INTERNAL_EXECUTION_DOMAIN
value: development
- name: FLYTE_ATTEMPT_NUMBER
value: "0"
- name: FLYTE_INTERNAL_TASK_PROJECT
value: flytesnacks
- name: FLYTE_INTERNAL_TASK_DOMAIN
value: development
- name: FLYTE_INTERNAL_TASK_NAME
value: workflows.distributed_training.horovod_train_task
- name: FLYTE_INTERNAL_TASK_VERSION
value: "2"
- name: FLYTE_INTERNAL_PROJECT
value: flytesnacks
- name: FLYTE_INTERNAL_DOMAIN
value: development
- name: FLYTE_INTERNAL_NAME
value: workflows.distributed_training.horovod_train_task
- name: FLYTE_INTERNAL_VERSION
value: "2"
- name: FLYTE_AWS_SECRET_ACCESS_KEY
value: miniostorage
- name: FLYTE_AWS_ENDPOINT
value: <http://minio.flyte.svc.cluster.local:9000>
- name: FLYTE_AWS_ACCESS_KEY_ID
value: minio
image: piloto_mpi:piloto
name: mpi
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 500m
memory: 1Gi
terminationMessagePolicy: FallbackToLogsOnError
restartPolicy: Never
runPolicy: {}
slotsPerWorker: 1
status:
conditions:
- lastTransitionTime: "2023-08-09T19:18:13Z"
lastUpdateTime: "2023-08-09T19:18:13Z"
message: MPIJob flytesnacks-development/a7nfz7zcpnczvmnhqt6s-n0-0 is created.
reason: MPIJobCreated
status: "True"
type: Created
replicaStatuses:
Launcher: {}
Worker:
active: 3
startTime: "2023-08-09T19:18:13Z"
ambitious-france-31318
08/09/2023, 7:29 PMpiloto_mpi:piloto
limited-raincoat-94253
08/09/2023, 7:30 PMpiloto_mpi:piloto
should the image you build right?ambitious-france-31318
08/09/2023, 7:30 PMpyflyte --pkgs workflows package --image piloto_mpi:piloto
limited-raincoat-94253
08/09/2023, 7:30 PMambitious-france-31318
08/09/2023, 7:30 PMlimited-raincoat-94253
08/09/2023, 7:30 PMambitious-france-31318
08/09/2023, 7:33 PME0809 19:18:13.614887 1 workers.go:102] error syncing 'flytesnacks-development/a7nfz7zcpnczvmnhqt6s': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [mpi]: found no current condition. Conditions: []
and when I look at the execution time of the job on the Flyte console.. I see that same exact time on the execution page of this workflowambitious-france-31318
08/09/2023, 7:35 PMkubectl get pods -n flytesnacks-development
NAME READY STATUS RESTARTS AGE
a7nfz7zcpnczvmnhqt6s-n0-0-launcher 0/1 Pending 0 15m
a7nfz7zcpnczvmnhqt6s-n0-0-worker-0 1/1 Running 0 15m
a7nfz7zcpnczvmnhqt6s-n0-0-worker-1 1/1 Running 0 15m
a7nfz7zcpnczvmnhqt6s-n0-0-worker-2 1/1 Running 0 15m
limited-raincoat-94253
08/09/2023, 7:36 PMlimited-raincoat-94253
08/09/2023, 7:37 PMlimited-raincoat-94253
08/09/2023, 7:37 PMambitious-france-31318
08/09/2023, 7:38 PMambitious-france-31318
08/09/2023, 7:39 PMEvents:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 8s (x5 over 20m) default-scheduler 0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
ambitious-france-31318
08/09/2023, 7:39 PMambitious-france-31318
08/09/2023, 7:40 PMlimited-raincoat-94253
08/09/2023, 7:40 PMlimited-raincoat-94253
08/09/2023, 7:40 PMambitious-france-31318
08/09/2023, 7:41 PMlimited-raincoat-94253
08/09/2023, 7:41 PMambitious-france-31318
08/09/2023, 7:42 PMambitious-france-31318
08/09/2023, 7:42 PMlimited-raincoat-94253
08/09/2023, 7:42 PMlimited-raincoat-94253
08/09/2023, 7:43 PMambitious-france-31318
08/09/2023, 7:43 PMambitious-france-31318
08/09/2023, 7:43 PMambitious-france-31318
08/09/2023, 7:44 PMlimited-raincoat-94253
08/09/2023, 7:46 PMkubectl delete --all pods -n flytesnacks-development
limited-raincoat-94253
08/09/2023, 7:46 PMambitious-france-31318
08/09/2023, 7:46 PMrequests=Resources(cpu="1", mem="2000Mi"),
I had this commented on the task annotation, if I uncomment it.. would that be enough? Or should I give it some more?ambitious-france-31318
08/09/2023, 7:47 PMambitious-france-31318
08/09/2023, 7:47 PMlimited-raincoat-94253
08/09/2023, 7:48 PMlimited-raincoat-94253
08/09/2023, 7:48 PMambitious-france-31318
08/09/2023, 7:49 PMrequests=Resources(cpu="1", mem="2000Mi")
ambitious-france-31318
08/09/2023, 7:50 PMlimited-raincoat-94253
08/09/2023, 7:51 PMambitious-france-31318
08/09/2023, 7:52 PMambitious-france-31318
08/09/2023, 8:01 PMambitious-france-31318
08/09/2023, 8:02 PMkubectl describe pod ahj2xvlznbr6t9knh79z-n0-0-launcher -n flytesnacks-development
Name: ahj2xvlznbr6t9knh79z-n0-0-launcher
Namespace: flytesnacks-development
Priority: 0
Service Account: ahj2xvlznbr6t9knh79z-n0-0-launcher
Node: <none>
Labels: <http://training.kubeflow.org/job-name=ahj2xvlznbr6t9knh79z-n0-0|training.kubeflow.org/job-name=ahj2xvlznbr6t9knh79z-n0-0>
<http://training.kubeflow.org/job-role=master|training.kubeflow.org/job-role=master>
<http://training.kubeflow.org/operator-name=mpijob-controller|training.kubeflow.org/operator-name=mpijob-controller>
<http://training.kubeflow.org/replica-type=launcher|training.kubeflow.org/replica-type=launcher>
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: MPIJob/ahj2xvlznbr6t9knh79z-n0-0
Init Containers:
kubectl-delivery:
Image: mpioperator/kubectl-delivery:latest
Port: <none>
Host Port: <none>
Limits:
cpu: 100m
ephemeral-storage: 5Gi
memory: 512Mi
Requests:
cpu: 100m
ephemeral-storage: 5Gi
memory: 512Mi
Environment:
TARGET_DIR: /opt/kube
NAMESPACE: flytesnacks-development
Mounts:
/etc/mpi from mpi-job-config (rw)
/opt/kube from mpi-job-kubectl (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2bp5h (ro)
Containers:
mpi:
Image: piloto_mpi:piloto
Port: <none>
Host Port: <none>
Args:
mpirun
--allow-run-as-root
-bind-to
none
-map-by
slot
-x
LD_LIBRARY_PATH
-x
PATH
-x
NCCL_DEBUG=INFO
-mca
pml
ob1
-mca
btl
^openib
-np
3
python
/opt/venv/bin/entrypoint.py
pyflyte-execute
--inputs
<s3://my-s3-bucket/metadata/propeller/flytesnacks-development-ahj2xvlznbr6t9knh79z/n0/data/inputs.pb>
--output-prefix
<s3://my-s3-bucket/metadata/propeller/flytesnacks-development-ahj2xvlznbr6t9knh79z/n0/data/0>
--raw-output-data-prefix
<s3://my-s3-bucket/im/ahj2xvlznbr6t9knh79z-n0-0>
--checkpoint-path
<s3://my-s3-bucket/im/ahj2xvlznbr6t9knh79z-n0-0/_flytecheckpoints>
--prev-checkpoint
""
--resolver
flytekit.core.python_auto_container.default_task_resolver
--
task-module
workflows.distributed_training
task-name
horovod_train_task
Limits:
cpu: 1
memory: 2000Mi
Requests:
cpu: 1
memory: 2000Mi
Environment:
FLYTE_INTERNAL_EXECUTION_WORKFLOW: flytesnacks:development:workflows.distributed_training.horovod_training_wf
FLYTE_INTERNAL_EXECUTION_ID: ahj2xvlznbr6t9knh79z
FLYTE_INTERNAL_EXECUTION_PROJECT: flytesnacks
FLYTE_INTERNAL_EXECUTION_DOMAIN: development
FLYTE_ATTEMPT_NUMBER: 0
FLYTE_INTERNAL_TASK_PROJECT: flytesnacks
FLYTE_INTERNAL_TASK_DOMAIN: development
FLYTE_INTERNAL_TASK_NAME: workflows.distributed_training.horovod_train_task
FLYTE_INTERNAL_TASK_VERSION: 3
FLYTE_INTERNAL_PROJECT: flytesnacks
FLYTE_INTERNAL_DOMAIN: development
FLYTE_INTERNAL_NAME: workflows.distributed_training.horovod_train_task
FLYTE_INTERNAL_VERSION: 3
FLYTE_AWS_ENDPOINT: <http://minio.flyte.svc.cluster.local:9000>
FLYTE_AWS_ACCESS_KEY_ID: minio
FLYTE_AWS_SECRET_ACCESS_KEY: miniostorage
OMPI_MCA_plm_rsh_agent: /etc/mpi/kubexec.sh
OMPI_MCA_orte_default_hostfile: /etc/mpi/hostfile
NVIDIA_VISIBLE_DEVICES:
NVIDIA_DRIVER_CAPABILITIES:
Mounts:
/etc/mpi from mpi-job-config (rw)
/opt/kube from mpi-job-kubectl (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2bp5h (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
mpi-job-kubectl:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
mpi-job-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: ahj2xvlznbr6t9knh79z-n0-0-config
Optional: false
kube-api-access-2bp5h:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
<http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3m28s default-scheduler 0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
ambitious-france-31318
08/09/2023, 8:02 PM2000Mi
limited-raincoat-94253
08/09/2023, 8:03 PMambitious-france-31318
08/09/2023, 8:03 PMlimited-raincoat-94253
08/09/2023, 8:03 PMambitious-france-31318
08/09/2023, 8:03 PMambitious-france-31318
08/09/2023, 8:03 PMlimited-raincoat-94253
08/09/2023, 8:11 PMambitious-france-31318
08/09/2023, 8:13 PMambitious-france-31318
08/09/2023, 8:14 PMEvents:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 95s default-scheduler 0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
Normal Scheduled 90s default-scheduler Successfully assigned flytesnacks-development/aq4nm75p6r5ljxmf56bd-n0-3-launcher to docker-desktop
Normal Pulled 89s kubelet Container image "mpioperator/kubectl-delivery:latest" already present on machine
Normal Created 89s kubelet Created container kubectl-delivery
Normal Started 89s kubelet Started container kubectl-delivery
Normal Pulled 84s kubelet Container image "piloto_mpi:piloto" already present on machine
Normal Created 83s kubelet Created container mpi
Normal Started 83s kubelet Started container mpi
ambitious-france-31318
08/09/2023, 8:14 PMlimited-raincoat-94253
08/09/2023, 8:15 PMambitious-france-31318
08/09/2023, 8:21 PMambitious-france-31318
08/09/2023, 8:23 PMlimited-raincoat-94253
08/09/2023, 8:25 PMrequests=Resources(cpu="1", mem="1000Mi")
limited-raincoat-94253
08/09/2023, 8:25 PMambitious-france-31318
08/10/2023, 12:43 PMInsufficient memory
) since I added the following properties to the task annotation requests=Resources(cpu="1", mem="1000Mi"),limits=Resources(cpu="2", mem="3000Mi"),
and this is what shows up now when I describe the pod:
kubectl describe pod apgpsl92cr9brztxkqsc-n0-2-launcher -n flytesnacks-development
Name: apgpsl92cr9brztxkqsc-n0-2-launcher
Namespace: flytesnacks-development
Priority: 0
Service Account: apgpsl92cr9brztxkqsc-n0-2-launcher
Node: docker-desktop/192.168.65.4
Start Time: Thu, 10 Aug 2023 08:44:49 -0300
Labels: <http://training.kubeflow.org/job-name=apgpsl92cr9brztxkqsc-n0-2|training.kubeflow.org/job-name=apgpsl92cr9brztxkqsc-n0-2>
<http://training.kubeflow.org/job-role=master|training.kubeflow.org/job-role=master>
<http://training.kubeflow.org/operator-name=mpijob-controller|training.kubeflow.org/operator-name=mpijob-controller>
<http://training.kubeflow.org/replica-type=launcher|training.kubeflow.org/replica-type=launcher>
Annotations: <none>
Status: Failed
IP: 10.1.0.191
IPs:
IP: 10.1.0.191
Controlled By: MPIJob/apgpsl92cr9brztxkqsc-n0-2
Init Containers:
kubectl-delivery:
Container ID: <docker://de736cc67b7c7e702257d377bcfb69556d638b7ce975360315ae566f3b41fd5>c
Image: mpioperator/kubectl-delivery:latest
Image ID: <docker-pullable://mpioperator/kubectl-delivery@sha256:8a4a24114e0bdc8df8f44e657baa6f5d47b24b1664b26c6f59e06575f8f21a55>
Port: <none>
Host Port: <none>
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 10 Aug 2023 08:44:49 -0300
Finished: Thu, 10 Aug 2023 08:44:55 -0300
Ready: True
Restart Count: 0
Limits:
cpu: 100m
ephemeral-storage: 5Gi
memory: 512Mi
Requests:
cpu: 100m
ephemeral-storage: 5Gi
memory: 512Mi
Environment:
TARGET_DIR: /opt/kube
NAMESPACE: flytesnacks-development
Mounts:
/etc/mpi from mpi-job-config (rw)
/opt/kube from mpi-job-kubectl (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-g99kl (ro)
Containers:
mpi:
Container ID: <docker://78e78d32b2927bdf4a04d3bf714877de6bb0c84bcc84668598c776c84f9448d>6
Image: piloto_mpi:piloto
Image ID: <docker://sha256:5>b9119f28d46ff4859859c2f588b86a5d18e319705c44cdd3a0081e391851433
Port: <none>
Host Port: <none>
Args:
mpirun
--allow-run-as-root
-bind-to
none
-map-by
slot
-x
LD_LIBRARY_PATH
-x
PATH
-x
NCCL_DEBUG=INFO
-mca
pml
ob1
-mca
btl
^openib
-np
1
python
/opt/venv/bin/entrypoint.py
pyflyte-execute
--inputs
<s3://my-s3-bucket/metadata/propeller/flytesnacks-development-apgpsl92cr9brztxkqsc/n0/data/inputs.pb>
--output-prefix
<s3://my-s3-bucket/metadata/propeller/flytesnacks-development-apgpsl92cr9brztxkqsc/n0/data/2>
--raw-output-data-prefix
<s3://my-s3-bucket/0x/apgpsl92cr9brztxkqsc-n0-2>
--checkpoint-path
<s3://my-s3-bucket/0x/apgpsl92cr9brztxkqsc-n0-2/_flytecheckpoints>
--prev-checkpoint
<s3://my-s3-bucket/pw/apgpsl92cr9brztxkqsc-n0-1/_flytecheckpoints>
--resolver
flytekit.core.python_auto_container.default_task_resolver
--
task-module
workflows.distributed_training
task-name
horovod_train_task
State: Terminated
Reason: Error
Message: ๏ฟฝ๏ฟฝ 295 โ โ โ โ return func(*args, **kwargs) โ
โ โ
โ /opt/venv/lib/python3.8/site-packages/flytekit/core/python_auto_container.py โ
โ :235 in load_task โ
โ โ
โ โฑ 235 โ โ task_module = importlib.import_module(name=task_module) # typ โ
โ โ
โ /usr/lib/python3.8/importlib/__init__.py:127 in import_module โ
โ โ
โ โฑ 127 โ return _bootstrap._gcd_import(name[level:], package, level) โ
โ in _gcd_import:1014 โ
โ in _find_and_load:991 โ
โ in _find_and_load_unlocked:973 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
ModuleNotFoundError: No module named 'workflows.distributed_training'
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[54850,1],0]
Exit code: 1
--------------------------------------------------------------------------
Exit Code: 1
Started: Thu, 10 Aug 2023 08:44:56 -0300
Finished: Thu, 10 Aug 2023 08:45:00 -0300
Ready: False
Restart Count: 0
Limits:
cpu: 2
memory: 3000Mi
Requests:
cpu: 1
memory: 1000Mi
Environment:
FLYTE_INTERNAL_EXECUTION_WORKFLOW: flytesnacks:development:workflows.distributed_training.horovod_training_wf
FLYTE_INTERNAL_EXECUTION_ID: apgpsl92cr9brztxkqsc
FLYTE_INTERNAL_EXECUTION_PROJECT: flytesnacks
FLYTE_INTERNAL_EXECUTION_DOMAIN: development
FLYTE_ATTEMPT_NUMBER: 2
FLYTE_INTERNAL_TASK_PROJECT: flytesnacks
FLYTE_INTERNAL_TASK_DOMAIN: development
FLYTE_INTERNAL_TASK_NAME: workflows.distributed_training.horovod_train_task
FLYTE_INTERNAL_TASK_VERSION: 6
FLYTE_INTERNAL_PROJECT: flytesnacks
FLYTE_INTERNAL_DOMAIN: development
FLYTE_INTERNAL_NAME: workflows.distributed_training.horovod_train_task
FLYTE_INTERNAL_VERSION: 6
FLYTE_AWS_ENDPOINT: <http://minio.flyte.svc.cluster.local:9000>
FLYTE_AWS_ACCESS_KEY_ID: minio
FLYTE_AWS_SECRET_ACCESS_KEY: miniostorage
OMPI_MCA_plm_rsh_agent: /etc/mpi/kubexec.sh
OMPI_MCA_orte_default_hostfile: /etc/mpi/hostfile
NVIDIA_VISIBLE_DEVICES:
NVIDIA_DRIVER_CAPABILITIES:
Mounts:
/etc/mpi from mpi-job-config (rw)
/opt/kube from mpi-job-kubectl (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-g99kl (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
mpi-job-kubectl:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
mpi-job-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: apgpsl92cr9brztxkqsc-n0-2-config
Optional: false
kube-api-access-g99kl:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
<http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 91s default-scheduler Successfully assigned flytesnacks-development/apgpsl92cr9brztxkqsc-n0-2-launcher to docker-desktop
Normal Pulled 91s kubelet Container image "mpioperator/kubectl-delivery:latest" already present on machine
Normal Created 91s kubelet Created container kubectl-delivery
Normal Started 91s kubelet Started container kubectl-delivery
Normal Pulled 84s kubelet Container image "piloto_mpi:piloto" already present on machine
Normal Created 84s kubelet Created container mpi
Normal Started 84s kubelet Started container mpi
ambitious-france-31318
08/10/2023, 12:44 PMModuleNotFoundError: No module named 'workflows.distributed_training'
ambitious-france-31318
08/10/2023, 2:29 PMpiloto_mpi
โโโ helm
โโโ workflows
โ โโโ distributed_training.py
โ โโโ example.py
โ โโโ logistic_regression_wine.py
โโโ Dockerfile
โโโ docker_build.sh
โโโ flyte-package.tgz
โโโ requirements.txt
As always, any help is greatly appreciated ๐ambitious-france-31318
08/10/2023, 2:30 PMpyflyte --pkgs workflows package --image piloto_mpi:piloto
flytectl register files --project flytesnacks --domain development --archive flyte-package.tgz --version 6
ambitious-france-31318
08/10/2023, 6:13 PMambitious-france-31318
08/10/2023, 6:16 PM@task
annotation?tall-lock-23197
pyflyte register --image <your-image> workflows
ambitious-france-31318
08/11/2023, 11:13 AM