Hi community I m getting an `UNKNOWN` status on every workfl Flyte #flyte-deployment

Hi community, I'm getting an `UNKNOWN` status on e...

ambitious-france-31318

08/04/2023, 7:38 PM

Hi community, I'm getting an

UNKNOWN

status on every workflow that I submit to Flyte and it just stays in that state (it never evolved to a

RUNNING

state). Some background of the Flyte installation: I have deployed Flyte on a local K8's before deploying it on our real K8's environment (sort of a POC). I have recently installed the MPI-Operator in order to be able to parallelize a ML workflow. Since I couldn't make an update of the Helm Chart because it was throwing the following error

Error: UPGRADE FAILED: rendered manifests contain a resource that already exists. Unable to continue with update: Secret "kubernetes-dashboard-csrf" in namespace "flyte" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "<http://meta.helm.sh/release-name|meta.helm.sh/release-name>" must equal "flyte": current value is "flyte-deps"

I ended up modifying the flyte-core values file, adding to the

ConfigMap

property the

enabled_plugins

property in accordance to what the documentation says. Question: What could be happening and how could I check what's going on under the hood? BTW, from time to time.. it's not weird to get a

error when navigating the console. Any help is greatly appreciated, thx!

tall-lock-23197

08/07/2023, 4:18 AM

Hi @ambitious-france-31318, were you able to get the workflows running successfully before you installed the MPI operator? Also, can you check the propeller logs?

ambitious-france-31318

08/07/2023, 7:09 PM

Hi @tall-lock-23197! Yes there was no issue at all. Now I have managed to move forward and the jobs are submitting but im getting the following error

E0807 16:38:21.462968       1 workers.go:102] error syncing 'flytesnacks-development/fa6f1f44573a6451e9cb': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [mpi]: found no current condition. Conditions: []

. A job that is not an MPI Job executes correctly, im only getting this error when submitting an MPI Job, if you have any suggestions on what I could try I'll really appreciate it 😄

ambitious-france-31318

08/07/2023, 7:11 PM

btw, im now using the training-operator (that was the tweaking that resolved the original issue, but now im having the issue that I mentioned in the above message 🥲)

tall-lock-23197

08/08/2023, 5:18 AM

Do you see any error on the UI? Also, have you followed the instructions outlined in https://docs.flyte.org/en/latest/deployment/plugins/k8s/index.html guide?

ambitious-france-31318

08/08/2023, 2:35 PM

Hi! Yess, I've been following the instructions but with no luck 😮‍💨 now I'm re-installing everything just to double check everything and try with a fresh start, but yeap seems to be that something is escaping from my eyes. What are the probable causes to this type of error? Im completely lost on that one

tall-lock-23197

08/08/2023, 3:59 PM

Ah, I now recall coming across this error. Can you relaunch your job? Also, are you spinning up a demo cluster after updating the

~/flyte/sandbox/config.yaml

file with the relevant MPI configuration?

tall-lock-23197

08/08/2023, 4:00 PM

Sorry, you've deployed Flyte locally, correct?

tall-lock-23197

08/08/2023, 4:54 PM

cc @freezing-boots-56761

ambitious-france-31318

08/08/2023, 5:37 PM

Hi! Sorry, went to grab something to eat. Yess, I did a local install of Flyte (I am using a local Kubernetes cluster that gets instantiated from docker desktop

ambitious-france-31318

08/08/2023, 5:42 PM

As soon as I make the helm update with the file containing in the

enabled-plugins

value the mpi plugin, the pods get restarted. If you still want to, I could try to relaunch the job once again, but im pretty sure that the job is already trying to be executed once the mpi plugin is enabled

ambitious-france-31318

08/08/2023, 7:11 PM

Im still trying to find out what's going on when I enable the mpi plugin, but I got to think… im working this from an M1 Mac, and earlier on I had to download and install the following dependencies:

Copy code

# I needed this ones to install this two in order to install Horovod and make the compile with cmake
brew install pkg-config libuv

# This ones to work with a workflow that has an MPI task
pip install flytekitplugins-kfmpi tensorflow cmake

# Im guessing that this ones to install Horovod 
HOROVOD_WITH_TENSORFLOW=1 pip install --no-cache-dir horovod kubeflow-training

I downloaded them when I was getting a different error (the job wasn't being submitted to Flyte, it was breaking saying something like "...[dependency_name] couldn't be found…" and once I installed them, I could submit the jobs to Flyte. Currently getting the error that I mentioned above

E0807 16:38:21.462968       1 workers.go:102] error syncing 'flytesnacks-development/fa6f1f44573a6451e9cb': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [mpi]: found no current condition. Conditions: []

tall-lock-23197

08/09/2023, 6:35 AM

@freezing-boots-56761, any idea what might be causing this issue? cc @glamorous-carpet-83516

tall-lock-23197

08/09/2023, 6:36 AM

cc @limited-raincoat-94253

limited-raincoat-94253

08/09/2023, 6:42 AM

was the kubeflow horovod operator correctly installed?

limited-raincoat-94253

08/09/2023, 6:42 AM

found no current condition usually means that the kubeflow api does not return a correct response

ambitious-france-31318

08/09/2023, 1:06 PM

Hi everyone! Hmm that's a good a question… I did struggle installing Horovod. Do you have any links that explains how to install it? I'll try to search for some and try them. I'll be posting any news here

tall-lock-23197

08/09/2023, 1:35 PM

Here's the Dockerfile I've used:

Copy code

FROM ubuntu:focal
LABEL org.opencontainers.image.source <https://github.com/flyteorg/flytesnacks>

WORKDIR /root
ENV VENV /opt/venv
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8
ENV PYTHONPATH /root
ENV DEBIAN_FRONTEND=noninteractive

# Install Python3 and other basics
RUN apt-get update \
    && apt-get install -y software-properties-common \
    && add-apt-repository ppa:ubuntu-toolchain-r/test \
    && add-apt-repository -y ppa:deadsnakes/ppa \
    && apt-get install -y \
    build-essential \
    cmake \
    g++-7 \
    curl \
    git \
    wget \
    python3.10 \
    python3.10-venv \
    python3.10-dev \
    make \
    libssl-dev \
    python3-pip \
    python3-wheel \
    libuv1

ENV VENV /opt/venv
# Virtual environment
RUN python3 -m venv ${VENV}
ENV PATH="${VENV}/bin:$PATH"

# Install wheel after venv is activated
RUN pip3 install wheel

# Install Open MPI
RUN wget --progress=dot:mega -O /tmp/openmpi-4.1.4-bin.tar.gz <https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.4.tar.gz> && \
    cd /tmp && tar -zxf /tmp/openmpi-4.1.4-bin.tar.gz && \
    mkdir openmpi-4.1.4/build && cd openmpi-4.1.4/build && ../configure --prefix=/usr/local && \
    make -j all && make install && ldconfig && \
    mpirun --version

# Allow OpenSSH to talk to containers without asking for confirmation
RUN mkdir -p /var/run/sshd
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
    echo "    StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
    mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config

# Install Python dependencies
COPY <http://requirements.in|requirements.in> /root
RUN pip install -r /root/requirements.in

# Install TensorFlow
RUN wget <https://tf.novaal.de/westmere/tensorflow-2.8.0-cp310-cp310-linux_x86_64.whl> && pip install tensorflow-2.8.0-cp310-cp310-linux_x86_64.whl

# Enable GPU
# ENV HOROVOD_GPU_OPERATIONS NCCL
RUN HOROVOD_WITH_MPI=1 pip install --no-cache-dir horovod==0.28.1

# Copy the actual code
COPY . /root/

# This tag is supplied by the build script and will be used to determine the version
# when registering tasks, workflows, and launch plans
ARG tag
ENV FLYTE_INTERNAL_IMAGE $tag

I'm also importing

horovod

in the flyte task: https://gist.github.com/samhita-alla/8a83eaf8a6cc61d85301abc58242f939.

tall-lock-23197

08/09/2023, 1:36 PM

Importing it in the flyte task because then I needn't install

horovod

on my system.

ambitious-france-31318

08/09/2023, 2:19 PM

I just finished re-installing Horovod using pip, still getting the same error. I did see in some documentation that there was some mentioning to create a docker image, but never quite understood how to use it (I mean, I don't know how to make Flyte to run my jobs on that docker image) so I moved on without creating any

ambitious-france-31318

08/09/2023, 5:33 PM

Okey, I think I found out how to make Flyte to run my jobs using a specific docker image, im trying it now. I'll be posting any news here 😄

ambitious-france-31318

08/09/2023, 6:24 PM

So.. I managed to make Flyte jobs execute using the same docker image that you sent me, but I still had no luck. The hello world example and the logistic regression with the wine dataset jobs ran with no errors, but… as soon as I try to execute the MPI job, it crashes showing the same error on the Flytepropeller logs

RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [mpi]: found no current condition. Conditions: []

. The job is the same that appears on this docs , the only thing that I changed now is importing Horovod directly in the Flyte task (I really appreciate that you mentioned that, without that change it was throwing an error, I wouldn't have guess it)

limited-raincoat-94253

08/09/2023, 6:25 PM

@ambitious-france-31318 can you describe the mpijob created?

limited-raincoat-94253

08/09/2023, 6:25 PM

I want to see if there is any mpijob created at all

limited-raincoat-94253

08/09/2023, 6:26 PM

kubectl get mpijob

ambitious-france-31318

08/09/2023, 6:26 PM

I only had to tweak the following on the Dockerfile: adding

python3-venv

on the

apt-get install

and installing TensorFlow directly from pip (and not using the whl directly

ambitious-france-31318

08/09/2023, 6:27 PM

Great, thx for the command. I'm going to execute it now and post the results here

ambitious-france-31318

08/09/2023, 6:28 PM

Where should I run it? If I run it on my local environment it shows

No resources found in default namespace.

ambitious-france-31318

08/09/2023, 6:29 PM

And if I run it on the Flytepropeller pod terminal, it shows the following:

/bin/sh: kubectl: not found

limited-raincoat-94253

08/09/2023, 6:29 PM

where does your tasks run?

limited-raincoat-94253

08/09/2023, 6:30 PM

try

kubectl get mpijob -n flytesnacks-development

ambitious-france-31318

08/09/2023, 6:31 PM

Mmm the workflows that ran ok, were executed on a local k8's cluster that has Flyte installed on it

ambitious-france-31318

08/09/2023, 6:31 PM

Ok, trying it now

ambitious-france-31318

08/09/2023, 6:32 PM

It showed the following with that command:

Copy code

adz9c7r4w9lvj4phntxs-n0-0   17m     Created
f691283563e5044edb39-n0-3   22h     Failed
febd3eefb998145288f7-n0-3   4h40m   Failed

ambitious-france-31318

08/09/2023, 6:33 PM

btw, the last MPI job that I tried to run was 20min ago approximately

ambitious-france-31318

08/09/2023, 6:33 PM

And it still on

RUNNING

state

limited-raincoat-94253

08/09/2023, 6:39 PM

can you do a describe on that mpijob?

limited-raincoat-94253

08/09/2023, 6:40 PM

so the API calls are correct, but suspect that you are using mpiv2 kubeflow operator

ambitious-france-31318

08/09/2023, 6:41 PM

Yes, sure

limited-raincoat-94253

08/09/2023, 6:41 PM

Copy code

kubectl get -o yaml mpijobs adz9c7r4w9lvj4phntxs-n0-0 -n flytesnacks-development

ambitious-france-31318

08/09/2023, 6:41 PM

Trying it now

ambitious-france-31318

08/09/2023, 6:41 PM

Ohhh that's great, im not too much of an expert on k8's, thx for the command!

ambitious-france-31318

08/09/2023, 6:43 PM

Okey, so this would be the output of that command:

Copy code

kubectl get -o yaml mpijobs adz9c7r4w9lvj4phntxs-n0-0 -n flytesnacks-development
apiVersion: <http://kubeflow.org/v1|kubeflow.org/v1>
kind: MPIJob
metadata:
  annotations:
    <http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: "false"
  creationTimestamp: "2023-08-09T18:13:55Z"
  generation: 1
  labels:
    domain: development
    execution-id: adz9c7r4w9lvj4phntxs
    interruptible: "false"
    node-id: n0
    project: flytesnacks
    shard-key: "10"
    task-name: workflows-distributed-training-horovod-train-task
    workflow-name: workflows-distributed-training-horovod-training-wf
  name: adz9c7r4w9lvj4phntxs-n0-0
  namespace: flytesnacks-development
  ownerReferences:
  - apiVersion: <http://flyte.lyft.com/v1alpha1|flyte.lyft.com/v1alpha1>
    blockOwnerDeletion: true
    controller: true
    kind: flyteworkflow
    name: adz9c7r4w9lvj4phntxs
    uid: 7e8c648e-2ba5-4364-aca3-8643e7764bd3
  resourceVersion: "58239"
  uid: 51d3e0c2-1f46-4348-a5f2-8f58bbfa5c3e
spec:
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      restartPolicy: Never
      template:
        metadata: {}
        spec:
          affinity: {}
          containers:
          - args:
            - mpirun
            - --allow-run-as-root
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -x
            - NCCL_DEBUG=INFO
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - -np
            - "3"
            - python
            - /opt/venv/bin/entrypoint.py
            - pyflyte-execute
            - --inputs
            - <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-adz9c7r4w9lvj4phntxs/n0/data/inputs.pb>
            - --output-prefix
            - <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-adz9c7r4w9lvj4phntxs/n0/data/0>
            - --raw-output-data-prefix
            - <s3://my-s3-bucket/q4/adz9c7r4w9lvj4phntxs-n0-0>
            - --checkpoint-path
            - <s3://my-s3-bucket/q4/adz9c7r4w9lvj4phntxs-n0-0/_flytecheckpoints>
            - --prev-checkpoint
            - '""'
            - --resolver
            - flytekit.core.python_auto_container.default_task_resolver
            - --
            - task-module
            - workflows.distributed_training
            - task-name
            - horovod_train_task
            env:
            - name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
              value: flytesnacks:development:workflows.distributed_training.horovod_training_wf
            - name: FLYTE_INTERNAL_EXECUTION_ID
              value: adz9c7r4w9lvj4phntxs
            - name: FLYTE_INTERNAL_EXECUTION_PROJECT
              value: flytesnacks
            - name: FLYTE_INTERNAL_EXECUTION_DOMAIN
              value: development
            - name: FLYTE_ATTEMPT_NUMBER
              value: "0"
            - name: FLYTE_INTERNAL_TASK_PROJECT
              value: flytesnacks
            - name: FLYTE_INTERNAL_TASK_DOMAIN
              value: development
            - name: FLYTE_INTERNAL_TASK_NAME
              value: workflows.distributed_training.horovod_train_task
            - name: FLYTE_INTERNAL_TASK_VERSION
              value: HEAD
            - name: FLYTE_INTERNAL_PROJECT
              value: flytesnacks
            - name: FLYTE_INTERNAL_DOMAIN
              value: development
            - name: FLYTE_INTERNAL_NAME
              value: workflows.distributed_training.horovod_train_task
            - name: FLYTE_INTERNAL_VERSION
              value: HEAD
            - name: FLYTE_AWS_ENDPOINT
              value: <http://minio.flyte.svc.cluster.local:9000>
            - name: FLYTE_AWS_ACCESS_KEY_ID
              value: minio
            - name: FLYTE_AWS_SECRET_ACCESS_KEY
              value: miniostorage
            image: piloto_mpi:piloto
            name: mpi
            resources:
              limits:
                cpu: 500m
                memory: 1Gi
              requests:
                cpu: 500m
                memory: 1Gi
            terminationMessagePolicy: FallbackToLogsOnError
          restartPolicy: Never
    Worker:
      replicas: 3
      restartPolicy: Never
      template:
        metadata: {}
        spec:
          affinity: {}
          containers:
          - args:
            - mpirun
            - --allow-run-as-root
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -x
            - NCCL_DEBUG=INFO
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - -np
            - "3"
            - python
            - /opt/venv/bin/entrypoint.py
            - pyflyte-execute
            - --inputs
            - <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-adz9c7r4w9lvj4phntxs/n0/data/inputs.pb>
            - --output-prefix
            - <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-adz9c7r4w9lvj4phntxs/n0/data/0>
            - --raw-output-data-prefix
            - <s3://my-s3-bucket/q4/adz9c7r4w9lvj4phntxs-n0-0>
            - --checkpoint-path
            - <s3://my-s3-bucket/q4/adz9c7r4w9lvj4phntxs-n0-0/_flytecheckpoints>
            - --prev-checkpoint
            - '""'
            - --resolver
            - flytekit.core.python_auto_container.default_task_resolver
            - --
            - task-module
            - workflows.distributed_training
            - task-name
            - horovod_train_task
            env:
            - name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
              value: flytesnacks:development:workflows.distributed_training.horovod_training_wf
            - name: FLYTE_INTERNAL_EXECUTION_ID
              value: adz9c7r4w9lvj4phntxs
            - name: FLYTE_INTERNAL_EXECUTION_PROJECT
              value: flytesnacks
            - name: FLYTE_INTERNAL_EXECUTION_DOMAIN
              value: development
            - name: FLYTE_ATTEMPT_NUMBER
              value: "0"
            - name: FLYTE_INTERNAL_TASK_PROJECT
              value: flytesnacks
            - name: FLYTE_INTERNAL_TASK_DOMAIN
              value: development
            - name: FLYTE_INTERNAL_TASK_NAME
              value: workflows.distributed_training.horovod_train_task
            - name: FLYTE_INTERNAL_TASK_VERSION
              value: HEAD
            - name: FLYTE_INTERNAL_PROJECT
              value: flytesnacks
            - name: FLYTE_INTERNAL_DOMAIN
              value: development
            - name: FLYTE_INTERNAL_NAME
              value: workflows.distributed_training.horovod_train_task
            - name: FLYTE_INTERNAL_VERSION
              value: HEAD
            - name: FLYTE_AWS_ENDPOINT
              value: <http://minio.flyte.svc.cluster.local:9000>
            - name: FLYTE_AWS_ACCESS_KEY_ID
              value: minio
            - name: FLYTE_AWS_SECRET_ACCESS_KEY
              value: miniostorage
            image: piloto_mpi:piloto
            name: mpi
            resources:
              limits:
                cpu: 500m
                memory: 1Gi
              requests:
                cpu: 500m
                memory: 1Gi
            terminationMessagePolicy: FallbackToLogsOnError
          restartPolicy: Never
  runPolicy: {}
  slotsPerWorker: 1
status:
  conditions:
  - lastTransitionTime: "2023-08-09T18:13:55Z"
    lastUpdateTime: "2023-08-09T18:13:55Z"
    message: MPIJob flytesnacks-development/adz9c7r4w9lvj4phntxs-n0-0 is created.
    reason: MPIJobCreated
    status: "True"
    type: Created
  replicaStatuses:
    Launcher: {}
    Worker:
      active: 3
  startTime: "2023-08-09T18:13:55Z"

limited-raincoat-94253

08/09/2023, 6:48 PM

can you do

k get pods -n flytesnacks-development

limited-raincoat-94253

08/09/2023, 6:48 PM

I think it is due to launcher failure

ambitious-france-31318

08/09/2023, 6:49 PM

Ohh okey, trying it now

ambitious-france-31318

08/09/2023, 6:49 PM

This is the output:

Copy code

kubectl get pods -n flytesnacks-development
NAME                                 READY   STATUS      RESTARTS   AGE
adrfzmnrchw57npfkcj8-n0-0            0/1     Completed   0          36m
adrfzmnrchw57npfkcj8-n1-0            0/1     Completed   0          36m
adrfzmnrchw57npfkcj8-n2-0            0/1     Completed   0          36m
adz9c7r4w9lvj4phntxs-n0-0-launcher   0/1     Pending     0          34m
adz9c7r4w9lvj4phntxs-n0-0-worker-0   1/1     Running     0          34m
adz9c7r4w9lvj4phntxs-n0-0-worker-1   1/1     Running     0          34m
adz9c7r4w9lvj4phntxs-n0-0-worker-2   1/1     Running     0          34m
ak9m6bcvjcvx9vbr8t9z-n0-0            0/1     Completed   0          38m
ak9m6bcvjcvx9vbr8t9z-n1-0            0/1     Completed   0          38m
f691283563e5044edb39-n0-3-launcher   0/1     Error       0          22h
f691283563e5044edb39-n0-3-worker-0   0/1     Error       0          22h
f691283563e5044edb39-n0-3-worker-1   0/1     Error       0          22h
f691283563e5044edb39-n0-3-worker-2   0/1     Error       0          22h
f6f8b932f56c54fbdbe2-n0-0            0/1     Completed   0          22h
f6f8b932f56c54fbdbe2-n1-0            0/1     Completed   0          22h
f6f8b932f56c54fbdbe2-n2-0            0/1     Completed   0          22h
f80e038fa12f44843812-n0-0            0/1     Completed   0          5h1m
f80e038fa12f44843812-n1-0            0/1     Completed   0          5h1m
f80e038fa12f44843812-n2-0            0/1     Completed   0          5h
febd3eefb998145288f7-n0-3-launcher   0/1     Error       0          4h58m
febd3eefb998145288f7-n0-3-worker-0   1/1     Running     0          4h58m
febd3eefb998145288f7-n0-3-worker-1   1/1     Running     0          4h58m
febd3eefb998145288f7-n0-3-worker-2   1/1     Running     0          4h58m

ambitious-france-31318

08/09/2023, 6:50 PM

Im guessing you were correct with the launcher failure?

limited-raincoat-94253

08/09/2023, 6:51 PM

try

kubectl logs f691283563e5044edb39-n0-3-launcher -n flytesnacks-development

ambitious-france-31318

08/09/2023, 6:51 PM

Okok, trying it now

ambitious-france-31318

08/09/2023, 6:52 PM

This is the output showed:

Copy code

kubectl logs f691283563e5044edb39-n0-3-launcher -n flytesnacks-development
Defaulted container "mpi" out of: mpi, kubectl-delivery (init)
unable to retrieve container logs for <docker://078ad2551be44d24c5c0050a75bd2b03a14c42fa07a7cc1a4adc401c6a0d385>0%

limited-raincoat-94253

08/09/2023, 6:52 PM

kubectl describe pod f691283563e5044edb39-n0-3-launcher -n flytesnacks-development

ambitious-france-31318

08/09/2023, 6:54 PM

This should be the output:

Copy code

kubectl describe pod f691283563e5044edb39-n0-3-launcher -n flytesnacks-development
Name:             f691283563e5044edb39-n0-3-launcher
Namespace:        flytesnacks-development
Priority:         0
Service Account:  f691283563e5044edb39-n0-3-launcher
Node:             docker-desktop/192.168.65.4
Start Time:       Tue, 08 Aug 2023 17:11:09 -0300
Labels:           <http://training.kubeflow.org/job-name=f691283563e5044edb39-n0-3|training.kubeflow.org/job-name=f691283563e5044edb39-n0-3>
                  <http://training.kubeflow.org/job-role=master|training.kubeflow.org/job-role=master>
                  <http://training.kubeflow.org/operator-name=mpijob-controller|training.kubeflow.org/operator-name=mpijob-controller>
                  <http://training.kubeflow.org/replica-type=launcher|training.kubeflow.org/replica-type=launcher>
Annotations:      <none>
Status:           Failed
IP:               10.1.0.83
IPs:
  IP:           10.1.0.83
Controlled By:  MPIJob/f691283563e5044edb39-n0-3
Init Containers:
  kubectl-delivery:
    Container ID:   <docker://dee3fc522e997f022eb4d76a024fabad8c33f04716472350eb2a5eb4039e8a6>0
    Image:          mpioperator/kubectl-delivery:latest
    Image ID:       <docker-pullable://mpioperator/kubectl-delivery@sha256:8a4a24114e0bdc8df8f44e657baa6f5d47b24b1664b26c6f59e06575f8f21a55>
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 08 Aug 2023 17:11:10 -0300
      Finished:     Tue, 08 Aug 2023 17:11:17 -0300
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:                100m
      ephemeral-storage:  5Gi
      memory:             512Mi
    Requests:
      cpu:                100m
      ephemeral-storage:  5Gi
      memory:             512Mi
    Environment:
      TARGET_DIR:  /opt/kube
      NAMESPACE:   flytesnacks-development
    Mounts:
      /etc/mpi from mpi-job-config (rw)
      /opt/kube from mpi-job-kubectl (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-z4whm (ro)
Containers:
  mpi:
    Container ID:  <docker://078ad2551be44d24c5c0050a75bd2b03a14c42fa07a7cc1a4adc401c6a0d385>0
    Image:         <http://cr.flyte.org/flyteorg/flytekit:py3.11-1.8.1|cr.flyte.org/flyteorg/flytekit:py3.11-1.8.1>
    Image ID:      <docker-pullable://cr.flyte.org/flyteorg/flytekit@sha256:07e13d5a3f49b918dcc323a1cb6f01c455b0c71fb46d784b3b958ba919afcc62>
    Port:          <none>
    Host Port:     <none>
    Args:
      pyflyte-fast-execute
      --additional-distribution
      <s3://my-s3-bucket/flytesnacks/development/A6JBX2NT37TMAF76N7B4ICKM6I======/script_mode.tar.gz>
      --dest-dir
      /root
      --
      mpirun
      --allow-run-as-root
      -bind-to
      none
      -map-by
      slot
      -x
      LD_LIBRARY_PATH
      -x
      PATH
      -x
      NCCL_DEBUG=INFO
      -mca
      pml
      ob1
      -mca
      btl
      ^openib
      -np
      3
      python
      /opt/venv/bin/entrypoint.py
      pyflyte-execute
      --inputs
      <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f691283563e5044edb39/n0/data/inputs.pb>
      --output-prefix
      <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f691283563e5044edb39/n0/data/3>
      --raw-output-data-prefix
      <s3://my-s3-bucket/54/f691283563e5044edb39-n0-3>
      --checkpoint-path
      <s3://my-s3-bucket/54/f691283563e5044edb39-n0-3/_flytecheckpoints>
      --prev-checkpoint
      <s3://my-s3-bucket/u3/f691283563e5044edb39-n0-2/_flytecheckpoints>
      --resolver
      flytekit.core.python_auto_container.default_task_resolver
      --
      task-module
      distributed_training
      task-name
      horovod_train_task
    State:      Terminated
      Reason:   Error
      Message:  │
│ /usr/local/lib/python3.11/site-packages/click/core.py:783 in invoke          │
│                                                                              │
│ ❱  783 │   │   │   │   return __callback(*args, **kwargs)                    │
│                                                                              │
│ /usr/local/lib/python3.11/site-packages/flytekit/bin/entrypoint.py:517 in    │
│ fast_execute_task_cmd                                                        │
│                                                                              │
│ ❱ 517 │   p = subprocess.run(cmd, check=False)                               │
│                                                                              │
│ /usr/local/lib/python3.11/subprocess.py:548 in run                           │
│                                                                              │
│ ❱  548 │   with Popen(*popenargs, **kwargs) as process:                      │
│                                                                              │
│ /usr/local/lib/python3.11/subprocess.py:1026 in __init__                     │
│                                                                              │
│ ❱ 1026 │   │   │   self._execute_child(args, executable, preexec_fn, close_f │
│                                                                              │
│ /usr/local/lib/python3.11/subprocess.py:1950 in _execute_child               │
│                                                                              │
│ ❱ 1950 │   │   │   │   │   raise child_exception_type(errno_num, err_msg, er │
╰──────────────────────────────────────────────────────────────────────────────╯
FileNotFoundError: [Errno 2] No such file or directory: 'mpirun'

      Exit Code:    1
      Started:      Tue, 08 Aug 2023 17:11:17 -0300
      Finished:     Tue, 08 Aug 2023 17:11:19 -0300
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     500m
      memory:  1Gi
    Requests:
      cpu:     500m
      memory:  1Gi
    Environment:
      FLYTE_INTERNAL_EXECUTION_WORKFLOW:  flytesnacks:development:distributed_training.horovod_training_wf
      FLYTE_INTERNAL_EXECUTION_ID:        f691283563e5044edb39
      FLYTE_INTERNAL_EXECUTION_PROJECT:   flytesnacks
      FLYTE_INTERNAL_EXECUTION_DOMAIN:    development
      FLYTE_ATTEMPT_NUMBER:               3
      FLYTE_INTERNAL_TASK_PROJECT:        flytesnacks
      FLYTE_INTERNAL_TASK_DOMAIN:         development
      FLYTE_INTERNAL_TASK_NAME:           distributed_training.horovod_train_task
      FLYTE_INTERNAL_TASK_VERSION:        c1os32vnshq5Bl9727lWxA==
      FLYTE_INTERNAL_PROJECT:             flytesnacks
      FLYTE_INTERNAL_DOMAIN:              development
      FLYTE_INTERNAL_NAME:                distributed_training.horovod_train_task
      FLYTE_INTERNAL_VERSION:             c1os32vnshq5Bl9727lWxA==
      FLYTE_AWS_ACCESS_KEY_ID:            minio
      FLYTE_AWS_SECRET_ACCESS_KEY:        miniostorage
      FLYTE_AWS_ENDPOINT:                 <http://minio.flyte.svc.cluster.local:9000>
      OMPI_MCA_plm_rsh_agent:             /etc/mpi/kubexec.sh
      OMPI_MCA_orte_default_hostfile:     /etc/mpi/hostfile
      NVIDIA_VISIBLE_DEVICES:             
      NVIDIA_DRIVER_CAPABILITIES:         
    Mounts:
      /etc/mpi from mpi-job-config (rw)
      /opt/kube from mpi-job-kubectl (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-z4whm (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  mpi-job-kubectl:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  mpi-job-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      f691283563e5044edb39-n0-3-config
    Optional:  false
  kube-api-access-z4whm:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:                      <none>

limited-raincoat-94253

08/09/2023, 6:56 PM

Copy code

<http://cr.flyte.org/flyteorg/flytekit:py3.11-1.8.1|cr.flyte.org/flyteorg/flytekit:py3.11-1.8.1>

probably does not have mpi installed correctly

ambitious-france-31318

08/09/2023, 6:56 PM

Mmm not sure why it says

FileNotFoundError: [Errno 2] No such file or directory: 'mpirun'

even though the image has an installation of MPI. I mean, if it's working with the one that I created

ambitious-france-31318

08/09/2023, 6:57 PM

Hmm how could we check that?

ambitious-france-31318

08/09/2023, 6:58 PM

Or how could I install it correctly? I thought that using the custom docker image was going to do the trick

limited-raincoat-94253

08/09/2023, 6:58 PM

https://docs.open-mpi.org/en/v5.0.x/installing-open-mpi/quickstart.html

limited-raincoat-94253

08/09/2023, 6:58 PM

oh wait

👀 1

limited-raincoat-94253

08/09/2023, 6:58 PM

I lost track

limited-raincoat-94253

08/09/2023, 6:59 PM

did you build your own image?

ambitious-france-31318

08/09/2023, 6:59 PM

haha no trouble at all

ambitious-france-31318

08/09/2023, 6:59 PM

Yeap

limited-raincoat-94253

08/09/2023, 6:59 PM

so your task should be using the new image

limited-raincoat-94253

08/09/2023, 6:59 PM

@task(image=<your_new_image>)

limited-raincoat-94253

08/09/2023, 6:59 PM

you need to specify it

ambitious-france-31318

08/09/2023, 6:59 PM

I tried running this workflow using the docker image that gets created with the Dockerfile that is on this thread 😄

ambitious-france-31318

08/09/2023, 7:00 PM

OHHHHHHHHHH

ambitious-france-31318

08/09/2023, 7:00 PM

Okok, let me try that out and I'll get back to you

ambitious-france-31318

08/09/2023, 7:01 PM

If that makes the trick, I promise you a bottle of wine and some asado if you ever come to Argentina hahaha

😂 1

limited-raincoat-94253

08/09/2023, 7:01 PM

but regardless, I think we were having some issues with correct failing mechanism. I assume when launcher fails the whole job should fail. Let me do some investigation later

limited-raincoat-94253

08/09/2023, 7:02 PM

yeah just keep me posted with the updates

ambitious-france-31318

08/09/2023, 7:04 PM

Yeap, no problem

ambitious-france-31318

08/09/2023, 7:05 PM

Just to check..could it be that instead of

@task(image="<image_name>")

, now it is

@task(container_image="<image_name>")

limited-raincoat-94253

08/09/2023, 7:06 PM

container_image is correct

limited-raincoat-94253

08/09/2023, 7:06 PM

sorry i was just trying to memorize it from top of my head, so can be inaccurate

ambitious-france-31318

08/09/2023, 7:07 PM

No problem at all! Ok, let me just build the package and register it and I'll be back with the results, fingers crossed!

limited-raincoat-94253

08/09/2023, 7:08 PM

one thing to add is that you I don’t know your setup, but you probably want to push your image somewhere.

limited-raincoat-94253

08/09/2023, 7:08 PM

are you running on minikube or something?

ambitious-france-31318

08/09/2023, 7:10 PM

Yeap, im using docker desktop and the Kubernetes that comes with it

ambitious-france-31318

08/09/2023, 7:11 PM

Im running everything on my Mac, wanted to get everything working locally before I take it to the definitive k8's cluster (somewhat like a POC)

limited-raincoat-94253

08/09/2023, 7:12 PM

I am not sure about “docker desktop and the Kubernetes that comes with it”

limited-raincoat-94253

08/09/2023, 7:12 PM

how did you setup your k8s cluster?

ambitious-france-31318

08/09/2023, 7:12 PM

I managed to register it, if it says that the workflows already exists… does it mean that they got uploaded anyways? Or they just got ignored?

limited-raincoat-94253

08/09/2023, 7:12 PM

try register with a newer version

ambitious-france-31318

08/09/2023, 7:13 PM

Ok, I'll try it now

ambitious-france-31318

08/09/2023, 7:14 PM

Coming back to your question, i installed docker (and docker desktop) on my notebook and.. for some reason, docker desktop has some configurations that you can tweak, and in there you can enable an option that says "Enable Kubernetes"

ambitious-france-31318

08/09/2023, 7:15 PM

Some minutes pass by and you have a k8's cluster running locally (im working with the default k8's cluster)

limited-raincoat-94253

08/09/2023, 7:17 PM

oh wow, that is something I just learnt. thanks for it

ambitious-france-31318

08/09/2023, 7:21 PM

Yeah, sure! No trouble at all! I had some really good luck when I found that out, it really saved me haha

ambitious-france-31318

08/09/2023, 7:22 PM

So, right now it shows like its running, but im not sure how I can give you some more info

limited-raincoat-94253

08/09/2023, 7:23 PM

you can check if

describe pod

gives you the correct image you specified

limited-raincoat-94253

08/09/2023, 7:23 PM

and you can log the launcher pod

ambitious-france-31318

08/09/2023, 7:24 PM

Oh that's nice, ok I'll try that

ambitious-france-31318

08/09/2023, 7:28 PM

It shows the following:

Copy code

kubectl get -o yaml mpijobs a7nfz7zcpnczvmnhqt6s-n0-0 -n flytesnacks-development
apiVersion: <http://kubeflow.org/v1|kubeflow.org/v1>
kind: MPIJob
metadata:
  annotations:
    <http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: "false"
  creationTimestamp: "2023-08-09T19:18:13Z"
  generation: 1
  labels:
    domain: development
    execution-id: a7nfz7zcpnczvmnhqt6s
    interruptible: "false"
    node-id: n0
    project: flytesnacks
    shard-key: "3"
    task-name: workflows-distributed-training-horovod-train-task
    workflow-name: workflows-distributed-training-horovod-training-wf
  name: a7nfz7zcpnczvmnhqt6s-n0-0
  namespace: flytesnacks-development
  ownerReferences:
  - apiVersion: <http://flyte.lyft.com/v1alpha1|flyte.lyft.com/v1alpha1>
    blockOwnerDeletion: true
    controller: true
    kind: flyteworkflow
    name: a7nfz7zcpnczvmnhqt6s
    uid: cef3f28b-bdf3-471a-bac3-0ace667f44d6
  resourceVersion: "70681"
  uid: 7d3e4d21-bd7d-424e-b0d4-af808d238d0c
spec:
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      restartPolicy: Never
      template:
        metadata: {}
        spec:
          affinity: {}
          containers:
          - args:
            - mpirun
            - --allow-run-as-root
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -x
            - NCCL_DEBUG=INFO
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - -np
            - "3"
            - python
            - /opt/venv/bin/entrypoint.py
            - pyflyte-execute
            - --inputs
            - <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-a7nfz7zcpnczvmnhqt6s/n0/data/inputs.pb>
            - --output-prefix
            - <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-a7nfz7zcpnczvmnhqt6s/n0/data/0>
            - --raw-output-data-prefix
            - <s3://my-s3-bucket/ar/a7nfz7zcpnczvmnhqt6s-n0-0>
            - --checkpoint-path
            - <s3://my-s3-bucket/ar/a7nfz7zcpnczvmnhqt6s-n0-0/_flytecheckpoints>
            - --prev-checkpoint
            - '""'
            - --resolver
            - flytekit.core.python_auto_container.default_task_resolver
            - --
            - task-module
            - workflows.distributed_training
            - task-name
            - horovod_train_task
            env:
            - name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
              value: flytesnacks:development:workflows.distributed_training.horovod_training_wf
            - name: FLYTE_INTERNAL_EXECUTION_ID
              value: a7nfz7zcpnczvmnhqt6s
            - name: FLYTE_INTERNAL_EXECUTION_PROJECT
              value: flytesnacks
            - name: FLYTE_INTERNAL_EXECUTION_DOMAIN
              value: development
            - name: FLYTE_ATTEMPT_NUMBER
              value: "0"
            - name: FLYTE_INTERNAL_TASK_PROJECT
              value: flytesnacks
            - name: FLYTE_INTERNAL_TASK_DOMAIN
              value: development
            - name: FLYTE_INTERNAL_TASK_NAME
              value: workflows.distributed_training.horovod_train_task
            - name: FLYTE_INTERNAL_TASK_VERSION
              value: "2"
            - name: FLYTE_INTERNAL_PROJECT
              value: flytesnacks
            - name: FLYTE_INTERNAL_DOMAIN
              value: development
            - name: FLYTE_INTERNAL_NAME
              value: workflows.distributed_training.horovod_train_task
            - name: FLYTE_INTERNAL_VERSION
              value: "2"
            - name: FLYTE_AWS_SECRET_ACCESS_KEY
              value: miniostorage
            - name: FLYTE_AWS_ENDPOINT
              value: <http://minio.flyte.svc.cluster.local:9000>
            - name: FLYTE_AWS_ACCESS_KEY_ID
              value: minio
            image: piloto_mpi:piloto
            name: mpi
            resources:
              limits:
                cpu: 500m
                memory: 1Gi
              requests:
                cpu: 500m
                memory: 1Gi
            terminationMessagePolicy: FallbackToLogsOnError
          restartPolicy: Never
    Worker:
      replicas: 3
      restartPolicy: Never
      template:
        metadata: {}
        spec:
          affinity: {}
          containers:
          - args:
            - mpirun
            - --allow-run-as-root
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -x
            - NCCL_DEBUG=INFO
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - -np
            - "3"
            - python
            - /opt/venv/bin/entrypoint.py
            - pyflyte-execute
            - --inputs
            - <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-a7nfz7zcpnczvmnhqt6s/n0/data/inputs.pb>
            - --output-prefix
            - <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-a7nfz7zcpnczvmnhqt6s/n0/data/0>
            - --raw-output-data-prefix
            - <s3://my-s3-bucket/ar/a7nfz7zcpnczvmnhqt6s-n0-0>
            - --checkpoint-path
            - <s3://my-s3-bucket/ar/a7nfz7zcpnczvmnhqt6s-n0-0/_flytecheckpoints>
            - --prev-checkpoint
            - '""'
            - --resolver
            - flytekit.core.python_auto_container.default_task_resolver
            - --
            - task-module
            - workflows.distributed_training
            - task-name
            - horovod_train_task
            env:
            - name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
              value: flytesnacks:development:workflows.distributed_training.horovod_training_wf
            - name: FLYTE_INTERNAL_EXECUTION_ID
              value: a7nfz7zcpnczvmnhqt6s
            - name: FLYTE_INTERNAL_EXECUTION_PROJECT
              value: flytesnacks
            - name: FLYTE_INTERNAL_EXECUTION_DOMAIN
              value: development
            - name: FLYTE_ATTEMPT_NUMBER
              value: "0"
            - name: FLYTE_INTERNAL_TASK_PROJECT
              value: flytesnacks
            - name: FLYTE_INTERNAL_TASK_DOMAIN
              value: development
            - name: FLYTE_INTERNAL_TASK_NAME
              value: workflows.distributed_training.horovod_train_task
            - name: FLYTE_INTERNAL_TASK_VERSION
              value: "2"
            - name: FLYTE_INTERNAL_PROJECT
              value: flytesnacks
            - name: FLYTE_INTERNAL_DOMAIN
              value: development
            - name: FLYTE_INTERNAL_NAME
              value: workflows.distributed_training.horovod_train_task
            - name: FLYTE_INTERNAL_VERSION
              value: "2"
            - name: FLYTE_AWS_SECRET_ACCESS_KEY
              value: miniostorage
            - name: FLYTE_AWS_ENDPOINT
              value: <http://minio.flyte.svc.cluster.local:9000>
            - name: FLYTE_AWS_ACCESS_KEY_ID
              value: minio
            image: piloto_mpi:piloto
            name: mpi
            resources:
              limits:
                cpu: 500m
                memory: 1Gi
              requests:
                cpu: 500m
                memory: 1Gi
            terminationMessagePolicy: FallbackToLogsOnError
          restartPolicy: Never
  runPolicy: {}
  slotsPerWorker: 1
status:
  conditions:
  - lastTransitionTime: "2023-08-09T19:18:13Z"
    lastUpdateTime: "2023-08-09T19:18:13Z"
    message: MPIJob flytesnacks-development/a7nfz7zcpnczvmnhqt6s-n0-0 is created.
    reason: MPIJobCreated
    status: "True"
    type: Created
  replicaStatuses:
    Launcher: {}
    Worker:
      active: 3
  startTime: "2023-08-09T19:18:13Z"

ambitious-france-31318

08/09/2023, 7:29 PM

But now I was checking the last output of this command and it also stated that the image was

piloto_mpi:piloto

limited-raincoat-94253

08/09/2023, 7:30 PM

Copy code

piloto_mpi:piloto

should the image you build right?

ambitious-france-31318

08/09/2023, 7:30 PM

When I create the package, I execute this command:

pyflyte --pkgs workflows package --image piloto_mpi:piloto

limited-raincoat-94253

08/09/2023, 7:30 PM

that is correct

ambitious-france-31318

08/09/2023, 7:30 PM

Yeap, that would be the image

limited-raincoat-94253

08/09/2023, 7:30 PM

what’s the isssue now?

ambitious-france-31318

08/09/2023, 7:33 PM

That im not sure if it's executing properly. I mean, on the propeller logs I see the following

E0809 19:18:13.614887       1 workers.go:102] error syncing 'flytesnacks-development/a7nfz7zcpnczvmnhqt6s': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [mpi]: found no current condition. Conditions: []

and when I look at the execution time of the job on the Flyte console.. I see that same exact time on the execution page of this workflow

ambitious-france-31318

08/09/2023, 7:35 PM

The pods doesn't seem to be on error, but the launcher still on pending:

Copy code

kubectl get pods -n flytesnacks-development                                     
NAME                                 READY   STATUS      RESTARTS   AGE
a7nfz7zcpnczvmnhqt6s-n0-0-launcher   0/1     Pending     0          15m
a7nfz7zcpnczvmnhqt6s-n0-0-worker-0   1/1     Running     0          15m
a7nfz7zcpnczvmnhqt6s-n0-0-worker-1   1/1     Running     0          15m
a7nfz7zcpnczvmnhqt6s-n0-0-worker-2   1/1     Running     0          15m

limited-raincoat-94253

08/09/2023, 7:36 PM

Can you describe that pending pod?

limited-raincoat-94253

08/09/2023, 7:37 PM

I think it’s a out of resource issue

limited-raincoat-94253

08/09/2023, 7:37 PM

Try delete all the running worker pods that supposed to be kill

ambitious-france-31318

08/09/2023, 7:38 PM

Yeah, sure! Brb with the results of the describe

ambitious-france-31318

08/09/2023, 7:39 PM

OMG, you are greattttt, yeah! You were right about the resources issue:

Copy code

Events:
  Type     Reason            Age               From               Message
  ----     ------            ----              ----               -------
  Warning  FailedScheduling  8s (x5 over 20m)  default-scheduler  0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

ambitious-france-31318

08/09/2023, 7:39 PM

How did you know? Hahaha

ambitious-france-31318

08/09/2023, 7:40 PM

And.. some more important question, how do we fix it? 😂

limited-raincoat-94253

08/09/2023, 7:40 PM

I’ve seen all these errors before 😂

🙌 1

limited-raincoat-94253

08/09/2023, 7:40 PM

delete all the pods that are running

ambitious-france-31318

08/09/2023, 7:41 PM

The launcher and the worker pods? Or only the launcher? Or.. do you mean, like.. every pod

limited-raincoat-94253

08/09/2023, 7:41 PM

I mean the running worker pods with the launcher pod failed

ambitious-france-31318

08/09/2023, 7:42 PM

Great… hmm just out of curious, you know how to do that? I'll google it otherwise, no biggie

ambitious-france-31318

08/09/2023, 7:42 PM

Haha

limited-raincoat-94253

08/09/2023, 7:42 PM

let me give you a command real quick

limited-raincoat-94253

08/09/2023, 7:43 PM

best to do now is probably delete all pods in that namespace and relaunch your task

ambitious-france-31318

08/09/2023, 7:43 PM

Thxx! Im more used to use k8's from rancher, where I can simply kill a pod by entering the deployment screen and reducing the instances of that deployment, but no idea how to do that with commands

ambitious-france-31318

08/09/2023, 7:43 PM

Okey, great

ambitious-france-31318

08/09/2023, 7:44 PM

if I terminate the job.. that should kill them, right?

limited-raincoat-94253

08/09/2023, 7:46 PM

Copy code

kubectl delete --all pods -n flytesnacks-development

limited-raincoat-94253

08/09/2023, 7:46 PM

try this

ambitious-france-31318

08/09/2023, 7:46 PM

And also..

requests=Resources(cpu="1", mem="2000Mi"),

I had this commented on the task annotation, if I uncomment it.. would that be enough? Or should I give it some more?

ambitious-france-31318

08/09/2023, 7:47 PM

btw, my notebook has 8 cpu and 16GB of RAM

ambitious-france-31318

08/09/2023, 7:47 PM

Ok, thx! Trying that command now

limited-raincoat-94253

08/09/2023, 7:48 PM

if you comment it out, it will use flyte default resource, which is like 500Mi

limited-raincoat-94253

08/09/2023, 7:48 PM

I doubt if that can even launch your image

ambitious-france-31318

08/09/2023, 7:49 PM

Haha okok, would you say that those specs should be enough? I mean

requests=Resources(cpu="1", mem="2000Mi")

ambitious-france-31318

08/09/2023, 7:50 PM

btw, the pods are all dead now. If we are okey with those specs I'll make a new package and relaunch the job 🙌

limited-raincoat-94253

08/09/2023, 7:51 PM

that should be enough

ambitious-france-31318

08/09/2023, 7:52 PM

Okey great! Fingers crossed!

ambitious-france-31318

08/09/2023, 8:01 PM

Hmm.. it still says that it has insufficient memory

ambitious-france-31318

08/09/2023, 8:02 PM

the describe shows the following:

Copy code

kubectl describe pod ahj2xvlznbr6t9knh79z-n0-0-launcher -n flytesnacks-development
Name:             ahj2xvlznbr6t9knh79z-n0-0-launcher
Namespace:        flytesnacks-development
Priority:         0
Service Account:  ahj2xvlznbr6t9knh79z-n0-0-launcher
Node:             <none>
Labels:           <http://training.kubeflow.org/job-name=ahj2xvlznbr6t9knh79z-n0-0|training.kubeflow.org/job-name=ahj2xvlznbr6t9knh79z-n0-0>
                  <http://training.kubeflow.org/job-role=master|training.kubeflow.org/job-role=master>
                  <http://training.kubeflow.org/operator-name=mpijob-controller|training.kubeflow.org/operator-name=mpijob-controller>
                  <http://training.kubeflow.org/replica-type=launcher|training.kubeflow.org/replica-type=launcher>
Annotations:      <none>
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    MPIJob/ahj2xvlznbr6t9knh79z-n0-0
Init Containers:
  kubectl-delivery:
    Image:      mpioperator/kubectl-delivery:latest
    Port:       <none>
    Host Port:  <none>
    Limits:
      cpu:                100m
      ephemeral-storage:  5Gi
      memory:             512Mi
    Requests:
      cpu:                100m
      ephemeral-storage:  5Gi
      memory:             512Mi
    Environment:
      TARGET_DIR:  /opt/kube
      NAMESPACE:   flytesnacks-development
    Mounts:
      /etc/mpi from mpi-job-config (rw)
      /opt/kube from mpi-job-kubectl (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2bp5h (ro)
Containers:
  mpi:
    Image:      piloto_mpi:piloto
    Port:       <none>
    Host Port:  <none>
    Args:
      mpirun
      --allow-run-as-root
      -bind-to
      none
      -map-by
      slot
      -x
      LD_LIBRARY_PATH
      -x
      PATH
      -x
      NCCL_DEBUG=INFO
      -mca
      pml
      ob1
      -mca
      btl
      ^openib
      -np
      3
      python
      /opt/venv/bin/entrypoint.py
      pyflyte-execute
      --inputs
      <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-ahj2xvlznbr6t9knh79z/n0/data/inputs.pb>
      --output-prefix
      <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-ahj2xvlznbr6t9knh79z/n0/data/0>
      --raw-output-data-prefix
      <s3://my-s3-bucket/im/ahj2xvlznbr6t9knh79z-n0-0>
      --checkpoint-path
      <s3://my-s3-bucket/im/ahj2xvlznbr6t9knh79z-n0-0/_flytecheckpoints>
      --prev-checkpoint
      ""
      --resolver
      flytekit.core.python_auto_container.default_task_resolver
      --
      task-module
      workflows.distributed_training
      task-name
      horovod_train_task
    Limits:
      cpu:     1
      memory:  2000Mi
    Requests:
      cpu:     1
      memory:  2000Mi
    Environment:
      FLYTE_INTERNAL_EXECUTION_WORKFLOW:  flytesnacks:development:workflows.distributed_training.horovod_training_wf
      FLYTE_INTERNAL_EXECUTION_ID:        ahj2xvlznbr6t9knh79z
      FLYTE_INTERNAL_EXECUTION_PROJECT:   flytesnacks
      FLYTE_INTERNAL_EXECUTION_DOMAIN:    development
      FLYTE_ATTEMPT_NUMBER:               0
      FLYTE_INTERNAL_TASK_PROJECT:        flytesnacks
      FLYTE_INTERNAL_TASK_DOMAIN:         development
      FLYTE_INTERNAL_TASK_NAME:           workflows.distributed_training.horovod_train_task
      FLYTE_INTERNAL_TASK_VERSION:        3
      FLYTE_INTERNAL_PROJECT:             flytesnacks
      FLYTE_INTERNAL_DOMAIN:              development
      FLYTE_INTERNAL_NAME:                workflows.distributed_training.horovod_train_task
      FLYTE_INTERNAL_VERSION:             3
      FLYTE_AWS_ENDPOINT:                 <http://minio.flyte.svc.cluster.local:9000>
      FLYTE_AWS_ACCESS_KEY_ID:            minio
      FLYTE_AWS_SECRET_ACCESS_KEY:        miniostorage
      OMPI_MCA_plm_rsh_agent:             /etc/mpi/kubexec.sh
      OMPI_MCA_orte_default_hostfile:     /etc/mpi/hostfile
      NVIDIA_VISIBLE_DEVICES:             
      NVIDIA_DRIVER_CAPABILITIES:         
    Mounts:
      /etc/mpi from mpi-job-config (rw)
      /opt/kube from mpi-job-kubectl (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2bp5h (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  mpi-job-kubectl:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  mpi-job-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      ahj2xvlznbr6t9knh79z-n0-0-config
    Optional:  false
  kube-api-access-2bp5h:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  3m28s  default-scheduler  0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

ambitious-france-31318

08/09/2023, 8:02 PM

it seems to be that now it is working with the

2000Mi

limited-raincoat-94253

08/09/2023, 8:03 PM

try reduce your worker to 2 or 1

ambitious-france-31318

08/09/2023, 8:03 PM

Does it really need that much?

limited-raincoat-94253

08/09/2023, 8:03 PM

let’s just make sure it works first

ambitious-france-31318

08/09/2023, 8:03 PM

Ohh okey okey

ambitious-france-31318

08/09/2023, 8:03 PM

Yeap, trying with only one worker now

limited-raincoat-94253

08/09/2023, 8:11 PM

might want to delete all the pods again before you try

ambitious-france-31318

08/09/2023, 8:13 PM

Yeap, I remembered to do that

ambitious-france-31318

08/09/2023, 8:14 PM

It still seems to be that is not enough memory, this is the output of the describe:

Copy code

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  95s   default-scheduler  0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
  Normal   Scheduled         90s   default-scheduler  Successfully assigned flytesnacks-development/aq4nm75p6r5ljxmf56bd-n0-3-launcher to docker-desktop
  Normal   Pulled            89s   kubelet            Container image "mpioperator/kubectl-delivery:latest" already present on machine
  Normal   Created           89s   kubelet            Created container kubectl-delivery
  Normal   Started           89s   kubelet            Started container kubectl-delivery
  Normal   Pulled            84s   kubelet            Container image "piloto_mpi:piloto" already present on machine
  Normal   Created           83s   kubelet            Created container mpi
  Normal   Started           83s   kubelet            Started container mpi

ambitious-france-31318

08/09/2023, 8:14 PM

I think we should increase the value of the memory?

limited-raincoat-94253

08/09/2023, 8:15 PM

yeah we can try that, or lower memory of a task to 1000Mi or sth

ambitious-france-31318

08/09/2023, 8:21 PM

I know how to do the first one, not sure how to do the second one haha

ambitious-france-31318

08/09/2023, 8:23 PM

Sorry I have to run now, but as soon as I can I'll try to make the first one and if it happens that the second one is here it would be great! I hope it gets fixed with this and if not I hope I can catch you tomorrow, you are being great and also giving some excellent support about all this, thx!

limited-raincoat-94253

08/09/2023, 8:25 PM

np, you just set

requests=Resources(cpu="1", mem="1000Mi")

limited-raincoat-94253

08/09/2023, 8:25 PM

good luck

ambitious-france-31318

08/10/2023, 12:43 PM

Hi everyone! It's me again haha, sorry to bother you another day. Apparently now the pods are starting correctly (there's no longer the error of

Insufficient memory

) since I added the following properties to the task annotation

requests=Resources(cpu="1", mem="1000Mi"),limits=Resources(cpu="2", mem="3000Mi"),

and this is what shows up now when I describe the pod:

Copy code

kubectl describe pod apgpsl92cr9brztxkqsc-n0-2-launcher -n flytesnacks-development
Name:             apgpsl92cr9brztxkqsc-n0-2-launcher
Namespace:        flytesnacks-development
Priority:         0
Service Account:  apgpsl92cr9brztxkqsc-n0-2-launcher
Node:             docker-desktop/192.168.65.4
Start Time:       Thu, 10 Aug 2023 08:44:49 -0300
Labels:           <http://training.kubeflow.org/job-name=apgpsl92cr9brztxkqsc-n0-2|training.kubeflow.org/job-name=apgpsl92cr9brztxkqsc-n0-2>
                  <http://training.kubeflow.org/job-role=master|training.kubeflow.org/job-role=master>
                  <http://training.kubeflow.org/operator-name=mpijob-controller|training.kubeflow.org/operator-name=mpijob-controller>
                  <http://training.kubeflow.org/replica-type=launcher|training.kubeflow.org/replica-type=launcher>
Annotations:      <none>
Status:           Failed
IP:               10.1.0.191
IPs:
  IP:           10.1.0.191
Controlled By:  MPIJob/apgpsl92cr9brztxkqsc-n0-2
Init Containers:
  kubectl-delivery:
    Container ID:   <docker://de736cc67b7c7e702257d377bcfb69556d638b7ce975360315ae566f3b41fd5>c
    Image:          mpioperator/kubectl-delivery:latest
    Image ID:       <docker-pullable://mpioperator/kubectl-delivery@sha256:8a4a24114e0bdc8df8f44e657baa6f5d47b24b1664b26c6f59e06575f8f21a55>
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 10 Aug 2023 08:44:49 -0300
      Finished:     Thu, 10 Aug 2023 08:44:55 -0300
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:                100m
      ephemeral-storage:  5Gi
      memory:             512Mi
    Requests:
      cpu:                100m
      ephemeral-storage:  5Gi
      memory:             512Mi
    Environment:
      TARGET_DIR:  /opt/kube
      NAMESPACE:   flytesnacks-development
    Mounts:
      /etc/mpi from mpi-job-config (rw)
      /opt/kube from mpi-job-kubectl (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-g99kl (ro)
Containers:
  mpi:
    Container ID:  <docker://78e78d32b2927bdf4a04d3bf714877de6bb0c84bcc84668598c776c84f9448d>6
    Image:         piloto_mpi:piloto
    Image ID:      <docker://sha256:5>b9119f28d46ff4859859c2f588b86a5d18e319705c44cdd3a0081e391851433
    Port:          <none>
    Host Port:     <none>
    Args:
      mpirun
      --allow-run-as-root
      -bind-to
      none
      -map-by
      slot
      -x
      LD_LIBRARY_PATH
      -x
      PATH
      -x
      NCCL_DEBUG=INFO
      -mca
      pml
      ob1
      -mca
      btl
      ^openib
      -np
      1
      python
      /opt/venv/bin/entrypoint.py
      pyflyte-execute
      --inputs
      <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-apgpsl92cr9brztxkqsc/n0/data/inputs.pb>
      --output-prefix
      <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-apgpsl92cr9brztxkqsc/n0/data/2>
      --raw-output-data-prefix
      <s3://my-s3-bucket/0x/apgpsl92cr9brztxkqsc-n0-2>
      --checkpoint-path
      <s3://my-s3-bucket/0x/apgpsl92cr9brztxkqsc-n0-2/_flytecheckpoints>
      --prev-checkpoint
      <s3://my-s3-bucket/pw/apgpsl92cr9brztxkqsc-n0-1/_flytecheckpoints>
      --resolver
      flytekit.core.python_auto_container.default_task_resolver
      --
      task-module
      workflows.distributed_training
      task-name
      horovod_train_task
    State:      Terminated
      Reason:   Error
      Message:  �� 295 │   │   │   │   return func(*args, **kwargs)                           │
│                                                                              │
│ /opt/venv/lib/python3.8/site-packages/flytekit/core/python_auto_container.py │
│ :235 in load_task                                                            │
│                                                                              │
│ ❱ 235 │   │   task_module = importlib.import_module(name=task_module)  # typ │
│                                                                              │
│ /usr/lib/python3.8/importlib/__init__.py:127 in import_module                │
│                                                                              │
│ ❱ 127 │   return _bootstrap._gcd_import(name[level:], package, level)        │
│ in _gcd_import:1014                                                          │
│ in _find_and_load:991                                                        │
│ in _find_and_load_unlocked:973                                               │
╰──────────────────────────────────────────────────────────────────────────────╯
ModuleNotFoundError: No module named 'workflows.distributed_training'
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[54850,1],0]
  Exit code:    1
--------------------------------------------------------------------------

      Exit Code:    1
      Started:      Thu, 10 Aug 2023 08:44:56 -0300
      Finished:     Thu, 10 Aug 2023 08:45:00 -0300
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  3000Mi
    Requests:
      cpu:     1
      memory:  1000Mi
    Environment:
      FLYTE_INTERNAL_EXECUTION_WORKFLOW:  flytesnacks:development:workflows.distributed_training.horovod_training_wf
      FLYTE_INTERNAL_EXECUTION_ID:        apgpsl92cr9brztxkqsc
      FLYTE_INTERNAL_EXECUTION_PROJECT:   flytesnacks
      FLYTE_INTERNAL_EXECUTION_DOMAIN:    development
      FLYTE_ATTEMPT_NUMBER:               2
      FLYTE_INTERNAL_TASK_PROJECT:        flytesnacks
      FLYTE_INTERNAL_TASK_DOMAIN:         development
      FLYTE_INTERNAL_TASK_NAME:           workflows.distributed_training.horovod_train_task
      FLYTE_INTERNAL_TASK_VERSION:        6
      FLYTE_INTERNAL_PROJECT:             flytesnacks
      FLYTE_INTERNAL_DOMAIN:              development
      FLYTE_INTERNAL_NAME:                workflows.distributed_training.horovod_train_task
      FLYTE_INTERNAL_VERSION:             6
      FLYTE_AWS_ENDPOINT:                 <http://minio.flyte.svc.cluster.local:9000>
      FLYTE_AWS_ACCESS_KEY_ID:            minio
      FLYTE_AWS_SECRET_ACCESS_KEY:        miniostorage
      OMPI_MCA_plm_rsh_agent:             /etc/mpi/kubexec.sh
      OMPI_MCA_orte_default_hostfile:     /etc/mpi/hostfile
      NVIDIA_VISIBLE_DEVICES:             
      NVIDIA_DRIVER_CAPABILITIES:         
    Mounts:
      /etc/mpi from mpi-job-config (rw)
      /opt/kube from mpi-job-kubectl (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-g99kl (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  mpi-job-kubectl:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  mpi-job-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      apgpsl92cr9brztxkqsc-n0-2-config
    Optional:  false
  kube-api-access-g99kl:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  91s   default-scheduler  Successfully assigned flytesnacks-development/apgpsl92cr9brztxkqsc-n0-2-launcher to docker-desktop
  Normal  Pulled     91s   kubelet            Container image "mpioperator/kubectl-delivery:latest" already present on machine
  Normal  Created    91s   kubelet            Created container kubectl-delivery
  Normal  Started    91s   kubelet            Started container kubectl-delivery
  Normal  Pulled     84s   kubelet            Container image "piloto_mpi:piloto" already present on machine
  Normal  Created    84s   kubelet            Created container mpi
  Normal  Started    84s   kubelet            Started container mpi

ambitious-france-31318

08/10/2023, 12:44 PM

I believe that this might be the more relevant part

ModuleNotFoundError: No module named 'workflows.distributed_training'

ambitious-france-31318

08/10/2023, 2:29 PM

btw, my project structure is as follow:

Copy code

piloto_mpi
├── helm
├── workflows
│         ├── distributed_training.py
│         ├── example.py
│         ├── logistic_regression_wine.py
├── Dockerfile
├── docker_build.sh
├── flyte-package.tgz
└── requirements.txt

As always, any help is greatly appreciated 😄

ambitious-france-31318

08/10/2023, 2:30 PM

And this are the two commands that I execute to register the workflows to Flyte:

Copy code

pyflyte --pkgs workflows package --image piloto_mpi:piloto
flytectl register files --project flytesnacks --domain development --archive flyte-package.tgz --version 6

ambitious-france-31318

08/10/2023, 6:13 PM

Hmmm ok it seems to be working now, apparently I had to re-build the image! 🎉 Thx everyone for the help!

ambitious-france-31318

08/10/2023, 6:16 PM

But there's still a question that remains in my head.. why did I need to re-build the docker image that I wanted to use? I mean, does that mean that I should re-build the image every time a make a change to the workflow code? Or only when I make changes to the properties passed to the

@task

annotation?

tall-lock-23197

08/11/2023, 4:39 AM

Since you're registering (but not fast-registering) your code, you need to build a Docker image every time you change your code. Can you try fast-registering your code? You can use this command:

pyflyte register --image <your-image> workflows

ambitious-france-31318

08/11/2023, 11:13 AM

Ohh okey okey, I'll try with that one and see how it goes! Thx again for all the help! 🙌

430 Views

Open in Slack

Previous Next