Hi community, I'm getting an `UNKNOWN` status on e...
# flyte-deployment
g
Hi community, I'm getting an
UNKNOWN
status on every workflow that I submit to Flyte and it just stays in that state (it never evolved to a
RUNNING
state). Some background of the Flyte installation: I have deployed Flyte on a local K8's before deploying it on our real K8's environment (sort of a POC). I have recently installed the MPI-Operator in order to be able to parallelize a ML workflow. Since I couldn't make an update of the Helm Chart because it was throwing the following error
Error: UPGRADE FAILED: rendered manifests contain a resource that already exists. Unable to continue with update: Secret "kubernetes-dashboard-csrf" in namespace "flyte" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "<http://meta.helm.sh/release-name|meta.helm.sh/release-name>" must equal "flyte": current value is "flyte-deps"
I ended up modifying the flyte-core values file, adding to the
ConfigMap
property the
enabled_plugins
property in accordance to what the documentation says. Question: What could be happening and how could I check what's going on under the hood? BTW, from time to time.. it's not weird to get a
503
error when navigating the console. Any help is greatly appreciated, thx!
s
Hi @Gast贸n Ceccotti, were you able to get the workflows running successfully before you installed the MPI operator? Also, can you check the propeller logs?
g
Hi @Samhita Alla! Yes there was no issue at all. Now I have managed to move forward and the jobs are submitting but im getting the following error
E0807 16:38:21.462968       1 workers.go:102] error syncing 'flytesnacks-development/fa6f1f44573a6451e9cb': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [mpi]: found no current condition. Conditions: []
. A job that is not an MPI Job executes correctly, im only getting this error when submitting an MPI Job, if you have any suggestions on what I could try I'll really appreciate it 馃槃
btw, im now using the training-operator (that was the tweaking that resolved the original issue, but now im having the issue that I mentioned in the above message 馃ゲ)
s
Do you see any error on the UI? Also, have you followed the instructions outlined in https://docs.flyte.org/en/latest/deployment/plugins/k8s/index.html guide?
g
Hi! Yess, I've been following the instructions but with no luck 馃槷鈥嶐煉 now I'm re-installing everything just to double check everything and try with a fresh start, but yeap seems to be that something is escaping from my eyes. What are the probable causes to this type of error? Im completely lost on that one
s
Ah, I now recall coming across this error. Can you relaunch your job? Also, are you spinning up a demo cluster after updating the
~/flyte/sandbox/config.yaml
file with the relevant MPI configuration?
Sorry, you've deployed Flyte locally, correct?
cc @jeev
g
Hi! Sorry, went to grab something to eat. Yess, I did a local install of Flyte (I am using a local Kubernetes cluster that gets instantiated from docker desktop
As soon as I make the helm update with the file containing in the
enabled-plugins
value the mpi plugin, the pods get restarted. If you still want to, I could try to relaunch the job once again, but im pretty sure that the job is already trying to be executed once the mpi plugin is enabled
Im still trying to find out what's going on when I enable the mpi plugin, but I got to think鈥 im working this from an M1 Mac, and earlier on I had to download and install the following dependencies:
Copy code
# I needed this ones to install this two in order to install Horovod and make the compile with cmake
brew install pkg-config libuv

# This ones to work with a workflow that has an MPI task
pip install flytekitplugins-kfmpi tensorflow cmake

# Im guessing that this ones to install Horovod 
HOROVOD_WITH_TENSORFLOW=1 pip install --no-cache-dir horovod kubeflow-training
I downloaded them when I was getting a different error (the job wasn't being submitted to Flyte, it was breaking saying something like "...[dependency_name] couldn't be found鈥" and once I installed them, I could submit the jobs to Flyte. Currently getting the error that I mentioned above
E0807 16:38:21.462968       1 workers.go:102] error syncing 'flytesnacks-development/fa6f1f44573a6451e9cb': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [mpi]: found no current condition. Conditions: []
s
@jeev, any idea what might be causing this issue? cc @Kevin Su
cc @Yubo Wang
y
was the kubeflow horovod operator correctly installed?
found no current condition usually means that the kubeflow api does not return a correct response
g
Hi everyone! Hmm that's a good a question鈥 I did struggle installing Horovod. Do you have any links that explains how to install it? I'll try to search for some and try them. I'll be posting any news here
s
Here's the Dockerfile I've used:
Copy code
FROM ubuntu:focal
LABEL org.opencontainers.image.source <https://github.com/flyteorg/flytesnacks>

WORKDIR /root
ENV VENV /opt/venv
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8
ENV PYTHONPATH /root
ENV DEBIAN_FRONTEND=noninteractive

# Install Python3 and other basics
RUN apt-get update \
    && apt-get install -y software-properties-common \
    && add-apt-repository ppa:ubuntu-toolchain-r/test \
    && add-apt-repository -y ppa:deadsnakes/ppa \
    && apt-get install -y \
    build-essential \
    cmake \
    g++-7 \
    curl \
    git \
    wget \
    python3.10 \
    python3.10-venv \
    python3.10-dev \
    make \
    libssl-dev \
    python3-pip \
    python3-wheel \
    libuv1

ENV VENV /opt/venv
# Virtual environment
RUN python3 -m venv ${VENV}
ENV PATH="${VENV}/bin:$PATH"

# Install wheel after venv is activated
RUN pip3 install wheel

# Install Open MPI
RUN wget --progress=dot:mega -O /tmp/openmpi-4.1.4-bin.tar.gz <https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.4.tar.gz> && \
    cd /tmp && tar -zxf /tmp/openmpi-4.1.4-bin.tar.gz && \
    mkdir openmpi-4.1.4/build && cd openmpi-4.1.4/build && ../configure --prefix=/usr/local && \
    make -j all && make install && ldconfig && \
    mpirun --version

# Allow OpenSSH to talk to containers without asking for confirmation
RUN mkdir -p /var/run/sshd
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
    echo "    StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
    mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config

# Install Python dependencies
COPY <http://requirements.in|requirements.in> /root
RUN pip install -r /root/requirements.in

# Install TensorFlow
RUN wget <https://tf.novaal.de/westmere/tensorflow-2.8.0-cp310-cp310-linux_x86_64.whl> && pip install tensorflow-2.8.0-cp310-cp310-linux_x86_64.whl

# Enable GPU
# ENV HOROVOD_GPU_OPERATIONS NCCL
RUN HOROVOD_WITH_MPI=1 pip install --no-cache-dir horovod==0.28.1

# Copy the actual code
COPY . /root/

# This tag is supplied by the build script and will be used to determine the version
# when registering tasks, workflows, and launch plans
ARG tag
ENV FLYTE_INTERNAL_IMAGE $tag
I'm also importing
horovod
in the flyte task: https://gist.github.com/samhita-alla/8a83eaf8a6cc61d85301abc58242f939.
Importing it in the flyte task because then I needn't install
horovod
on my system.
g
I just finished re-installing Horovod using pip, still getting the same error. I did see in some documentation that there was some mentioning to create a docker image, but never quite understood how to use it (I mean, I don't know how to make Flyte to run my jobs on that docker image) so I moved on without creating any
Okey, I think I found out how to make Flyte to run my jobs using a specific docker image, im trying it now. I'll be posting any news here 馃槃
So.. I managed to make Flyte jobs execute using the same docker image that you sent me, but I still had no luck. The hello world example and the logistic regression with the wine dataset jobs ran with no errors, but鈥 as soon as I try to execute the MPI job, it crashes showing the same error on the Flytepropeller logs
RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [mpi]: found no current condition. Conditions: []
. The job is the same that appears on this docs , the only thing that I changed now is importing Horovod directly in the Flyte task (I really appreciate that you mentioned that, without that change it was throwing an error, I wouldn't have guess it)
y
@Gast贸n Ceccotti can you describe the mpijob created?
I want to see if there is any mpijob created at all
kubectl get mpijob
g
I only had to tweak the following on the Dockerfile: adding
python3-venv
on the
apt-get install
and installing TensorFlow directly from pip (and not using the whl directly
Great, thx for the command. I'm going to execute it now and post the results here
Where should I run it? If I run it on my local environment it shows
No resources found in default namespace.
And if I run it on the Flytepropeller pod terminal, it shows the following:
/bin/sh: kubectl: not found
y
where does your tasks run?
try
kubectl get mpijob -n flytesnacks-development
g
Mmm the workflows that ran ok, were executed on a local k8's cluster that has Flyte installed on it
Ok, trying it now
It showed the following with that command:
Copy code
adz9c7r4w9lvj4phntxs-n0-0   17m     Created
f691283563e5044edb39-n0-3   22h     Failed
febd3eefb998145288f7-n0-3   4h40m   Failed
btw, the last MPI job that I tried to run was 20min ago approximately
And it still on
RUNNING
state
y
can you do a describe on that mpijob?
so the API calls are correct, but suspect that you are using mpiv2 kubeflow operator
g
Yes, sure
y
Copy code
kubectl get -o yaml mpijobs adz9c7r4w9lvj4phntxs-n0-0 -n flytesnacks-development
g
Trying it now
Ohhh that's great, im not too much of an expert on k8's, thx for the command!
Okey, so this would be the output of that command:
Copy code
kubectl get -o yaml mpijobs adz9c7r4w9lvj4phntxs-n0-0 -n flytesnacks-development
apiVersion: <http://kubeflow.org/v1|kubeflow.org/v1>
kind: MPIJob
metadata:
  annotations:
    <http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: "false"
  creationTimestamp: "2023-08-09T18:13:55Z"
  generation: 1
  labels:
    domain: development
    execution-id: adz9c7r4w9lvj4phntxs
    interruptible: "false"
    node-id: n0
    project: flytesnacks
    shard-key: "10"
    task-name: workflows-distributed-training-horovod-train-task
    workflow-name: workflows-distributed-training-horovod-training-wf
  name: adz9c7r4w9lvj4phntxs-n0-0
  namespace: flytesnacks-development
  ownerReferences:
  - apiVersion: <http://flyte.lyft.com/v1alpha1|flyte.lyft.com/v1alpha1>
    blockOwnerDeletion: true
    controller: true
    kind: flyteworkflow
    name: adz9c7r4w9lvj4phntxs
    uid: 7e8c648e-2ba5-4364-aca3-8643e7764bd3
  resourceVersion: "58239"
  uid: 51d3e0c2-1f46-4348-a5f2-8f58bbfa5c3e
spec:
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      restartPolicy: Never
      template:
        metadata: {}
        spec:
          affinity: {}
          containers:
          - args:
            - mpirun
            - --allow-run-as-root
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -x
            - NCCL_DEBUG=INFO
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - -np
            - "3"
            - python
            - /opt/venv/bin/entrypoint.py
            - pyflyte-execute
            - --inputs
            - <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-adz9c7r4w9lvj4phntxs/n0/data/inputs.pb>
            - --output-prefix
            - <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-adz9c7r4w9lvj4phntxs/n0/data/0>
            - --raw-output-data-prefix
            - <s3://my-s3-bucket/q4/adz9c7r4w9lvj4phntxs-n0-0>
            - --checkpoint-path
            - <s3://my-s3-bucket/q4/adz9c7r4w9lvj4phntxs-n0-0/_flytecheckpoints>
            - --prev-checkpoint
            - '""'
            - --resolver
            - flytekit.core.python_auto_container.default_task_resolver
            - --
            - task-module
            - workflows.distributed_training
            - task-name
            - horovod_train_task
            env:
            - name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
              value: flytesnacks:development:workflows.distributed_training.horovod_training_wf
            - name: FLYTE_INTERNAL_EXECUTION_ID
              value: adz9c7r4w9lvj4phntxs
            - name: FLYTE_INTERNAL_EXECUTION_PROJECT
              value: flytesnacks
            - name: FLYTE_INTERNAL_EXECUTION_DOMAIN
              value: development
            - name: FLYTE_ATTEMPT_NUMBER
              value: "0"
            - name: FLYTE_INTERNAL_TASK_PROJECT
              value: flytesnacks
            - name: FLYTE_INTERNAL_TASK_DOMAIN
              value: development
            - name: FLYTE_INTERNAL_TASK_NAME
              value: workflows.distributed_training.horovod_train_task
            - name: FLYTE_INTERNAL_TASK_VERSION
              value: HEAD
            - name: FLYTE_INTERNAL_PROJECT
              value: flytesnacks
            - name: FLYTE_INTERNAL_DOMAIN
              value: development
            - name: FLYTE_INTERNAL_NAME
              value: workflows.distributed_training.horovod_train_task
            - name: FLYTE_INTERNAL_VERSION
              value: HEAD
            - name: FLYTE_AWS_ENDPOINT
              value: <http://minio.flyte.svc.cluster.local:9000>
            - name: FLYTE_AWS_ACCESS_KEY_ID
              value: minio
            - name: FLYTE_AWS_SECRET_ACCESS_KEY
              value: miniostorage
            image: piloto_mpi:piloto
            name: mpi
            resources:
              limits:
                cpu: 500m
                memory: 1Gi
              requests:
                cpu: 500m
                memory: 1Gi
            terminationMessagePolicy: FallbackToLogsOnError
          restartPolicy: Never
    Worker:
      replicas: 3
      restartPolicy: Never
      template:
        metadata: {}
        spec:
          affinity: {}
          containers:
          - args:
            - mpirun
            - --allow-run-as-root
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -x
            - NCCL_DEBUG=INFO
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - -np
            - "3"
            - python
            - /opt/venv/bin/entrypoint.py
            - pyflyte-execute
            - --inputs
            - <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-adz9c7r4w9lvj4phntxs/n0/data/inputs.pb>
            - --output-prefix
            - <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-adz9c7r4w9lvj4phntxs/n0/data/0>
            - --raw-output-data-prefix
            - <s3://my-s3-bucket/q4/adz9c7r4w9lvj4phntxs-n0-0>
            - --checkpoint-path
            - <s3://my-s3-bucket/q4/adz9c7r4w9lvj4phntxs-n0-0/_flytecheckpoints>
            - --prev-checkpoint
            - '""'
            - --resolver
            - flytekit.core.python_auto_container.default_task_resolver
            - --
            - task-module
            - workflows.distributed_training
            - task-name
            - horovod_train_task
            env:
            - name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
              value: flytesnacks:development:workflows.distributed_training.horovod_training_wf
            - name: FLYTE_INTERNAL_EXECUTION_ID
              value: adz9c7r4w9lvj4phntxs
            - name: FLYTE_INTERNAL_EXECUTION_PROJECT
              value: flytesnacks
            - name: FLYTE_INTERNAL_EXECUTION_DOMAIN
              value: development
            - name: FLYTE_ATTEMPT_NUMBER
              value: "0"
            - name: FLYTE_INTERNAL_TASK_PROJECT
              value: flytesnacks
            - name: FLYTE_INTERNAL_TASK_DOMAIN
              value: development
            - name: FLYTE_INTERNAL_TASK_NAME
              value: workflows.distributed_training.horovod_train_task
            - name: FLYTE_INTERNAL_TASK_VERSION
              value: HEAD
            - name: FLYTE_INTERNAL_PROJECT
              value: flytesnacks
            - name: FLYTE_INTERNAL_DOMAIN
              value: development
            - name: FLYTE_INTERNAL_NAME
              value: workflows.distributed_training.horovod_train_task
            - name: FLYTE_INTERNAL_VERSION
              value: HEAD
            - name: FLYTE_AWS_ENDPOINT
              value: <http://minio.flyte.svc.cluster.local:9000>
            - name: FLYTE_AWS_ACCESS_KEY_ID
              value: minio
            - name: FLYTE_AWS_SECRET_ACCESS_KEY
              value: miniostorage
            image: piloto_mpi:piloto
            name: mpi
            resources:
              limits:
                cpu: 500m
                memory: 1Gi
              requests:
                cpu: 500m
                memory: 1Gi
            terminationMessagePolicy: FallbackToLogsOnError
          restartPolicy: Never
  runPolicy: {}
  slotsPerWorker: 1
status:
  conditions:
  - lastTransitionTime: "2023-08-09T18:13:55Z"
    lastUpdateTime: "2023-08-09T18:13:55Z"
    message: MPIJob flytesnacks-development/adz9c7r4w9lvj4phntxs-n0-0 is created.
    reason: MPIJobCreated
    status: "True"
    type: Created
  replicaStatuses:
    Launcher: {}
    Worker:
      active: 3
  startTime: "2023-08-09T18:13:55Z"
y
can you do
k get pods -n flytesnacks-development
I think it is due to launcher failure
g
Ohh okey, trying it now
This is the output:
Copy code
kubectl get pods -n flytesnacks-development
NAME                                 READY   STATUS      RESTARTS   AGE
adrfzmnrchw57npfkcj8-n0-0            0/1     Completed   0          36m
adrfzmnrchw57npfkcj8-n1-0            0/1     Completed   0          36m
adrfzmnrchw57npfkcj8-n2-0            0/1     Completed   0          36m
adz9c7r4w9lvj4phntxs-n0-0-launcher   0/1     Pending     0          34m
adz9c7r4w9lvj4phntxs-n0-0-worker-0   1/1     Running     0          34m
adz9c7r4w9lvj4phntxs-n0-0-worker-1   1/1     Running     0          34m
adz9c7r4w9lvj4phntxs-n0-0-worker-2   1/1     Running     0          34m
ak9m6bcvjcvx9vbr8t9z-n0-0            0/1     Completed   0          38m
ak9m6bcvjcvx9vbr8t9z-n1-0            0/1     Completed   0          38m
f691283563e5044edb39-n0-3-launcher   0/1     Error       0          22h
f691283563e5044edb39-n0-3-worker-0   0/1     Error       0          22h
f691283563e5044edb39-n0-3-worker-1   0/1     Error       0          22h
f691283563e5044edb39-n0-3-worker-2   0/1     Error       0          22h
f6f8b932f56c54fbdbe2-n0-0            0/1     Completed   0          22h
f6f8b932f56c54fbdbe2-n1-0            0/1     Completed   0          22h
f6f8b932f56c54fbdbe2-n2-0            0/1     Completed   0          22h
f80e038fa12f44843812-n0-0            0/1     Completed   0          5h1m
f80e038fa12f44843812-n1-0            0/1     Completed   0          5h1m
f80e038fa12f44843812-n2-0            0/1     Completed   0          5h
febd3eefb998145288f7-n0-3-launcher   0/1     Error       0          4h58m
febd3eefb998145288f7-n0-3-worker-0   1/1     Running     0          4h58m
febd3eefb998145288f7-n0-3-worker-1   1/1     Running     0          4h58m
febd3eefb998145288f7-n0-3-worker-2   1/1     Running     0          4h58m
Im guessing you were correct with the launcher failure?
y
try
kubectl logs f691283563e5044edb39-n0-3-launcher -n flytesnacks-development
g
Okok, trying it now
This is the output showed:
Copy code
kubectl logs f691283563e5044edb39-n0-3-launcher -n flytesnacks-development
Defaulted container "mpi" out of: mpi, kubectl-delivery (init)
unable to retrieve container logs for <docker://078ad2551be44d24c5c0050a75bd2b03a14c42fa07a7cc1a4adc401c6a0d385>0%
y
kubectl describe pod f691283563e5044edb39-n0-3-launcher -n flytesnacks-development
g
This should be the output:
Copy code
kubectl describe pod f691283563e5044edb39-n0-3-launcher -n flytesnacks-development
Name:             f691283563e5044edb39-n0-3-launcher
Namespace:        flytesnacks-development
Priority:         0
Service Account:  f691283563e5044edb39-n0-3-launcher
Node:             docker-desktop/192.168.65.4
Start Time:       Tue, 08 Aug 2023 17:11:09 -0300
Labels:           <http://training.kubeflow.org/job-name=f691283563e5044edb39-n0-3|training.kubeflow.org/job-name=f691283563e5044edb39-n0-3>
                  <http://training.kubeflow.org/job-role=master|training.kubeflow.org/job-role=master>
                  <http://training.kubeflow.org/operator-name=mpijob-controller|training.kubeflow.org/operator-name=mpijob-controller>
                  <http://training.kubeflow.org/replica-type=launcher|training.kubeflow.org/replica-type=launcher>
Annotations:      <none>
Status:           Failed
IP:               10.1.0.83
IPs:
  IP:           10.1.0.83
Controlled By:  MPIJob/f691283563e5044edb39-n0-3
Init Containers:
  kubectl-delivery:
    Container ID:   <docker://dee3fc522e997f022eb4d76a024fabad8c33f04716472350eb2a5eb4039e8a6>0
    Image:          mpioperator/kubectl-delivery:latest
    Image ID:       <docker-pullable://mpioperator/kubectl-delivery@sha256:8a4a24114e0bdc8df8f44e657baa6f5d47b24b1664b26c6f59e06575f8f21a55>
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 08 Aug 2023 17:11:10 -0300
      Finished:     Tue, 08 Aug 2023 17:11:17 -0300
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:                100m
      ephemeral-storage:  5Gi
      memory:             512Mi
    Requests:
      cpu:                100m
      ephemeral-storage:  5Gi
      memory:             512Mi
    Environment:
      TARGET_DIR:  /opt/kube
      NAMESPACE:   flytesnacks-development
    Mounts:
      /etc/mpi from mpi-job-config (rw)
      /opt/kube from mpi-job-kubectl (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-z4whm (ro)
Containers:
  mpi:
    Container ID:  <docker://078ad2551be44d24c5c0050a75bd2b03a14c42fa07a7cc1a4adc401c6a0d385>0
    Image:         <http://cr.flyte.org/flyteorg/flytekit:py3.11-1.8.1|cr.flyte.org/flyteorg/flytekit:py3.11-1.8.1>
    Image ID:      <docker-pullable://cr.flyte.org/flyteorg/flytekit@sha256:07e13d5a3f49b918dcc323a1cb6f01c455b0c71fb46d784b3b958ba919afcc62>
    Port:          <none>
    Host Port:     <none>
    Args:
      pyflyte-fast-execute
      --additional-distribution
      <s3://my-s3-bucket/flytesnacks/development/A6JBX2NT37TMAF76N7B4ICKM6I======/script_mode.tar.gz>
      --dest-dir
      /root
      --
      mpirun
      --allow-run-as-root
      -bind-to
      none
      -map-by
      slot
      -x
      LD_LIBRARY_PATH
      -x
      PATH
      -x
      NCCL_DEBUG=INFO
      -mca
      pml
      ob1
      -mca
      btl
      ^openib
      -np
      3
      python
      /opt/venv/bin/entrypoint.py
      pyflyte-execute
      --inputs
      <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f691283563e5044edb39/n0/data/inputs.pb>
      --output-prefix
      <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f691283563e5044edb39/n0/data/3>
      --raw-output-data-prefix
      <s3://my-s3-bucket/54/f691283563e5044edb39-n0-3>
      --checkpoint-path
      <s3://my-s3-bucket/54/f691283563e5044edb39-n0-3/_flytecheckpoints>
      --prev-checkpoint
      <s3://my-s3-bucket/u3/f691283563e5044edb39-n0-2/_flytecheckpoints>
      --resolver
      flytekit.core.python_auto_container.default_task_resolver
      --
      task-module
      distributed_training
      task-name
      horovod_train_task
    State:      Terminated
      Reason:   Error
      Message:  鈹
鈹 /usr/local/lib/python3.11/site-packages/click/core.py:783 in invoke          鈹
鈹                                                                              鈹
鈹 鉂  783 鈹   鈹   鈹   鈹   return __callback(*args, **kwargs)                    鈹
鈹                                                                              鈹
鈹 /usr/local/lib/python3.11/site-packages/flytekit/bin/entrypoint.py:517 in    鈹
鈹 fast_execute_task_cmd                                                        鈹
鈹                                                                              鈹
鈹 鉂 517 鈹   p = subprocess.run(cmd, check=False)                               鈹
鈹                                                                              鈹
鈹 /usr/local/lib/python3.11/subprocess.py:548 in run                           鈹
鈹                                                                              鈹
鈹 鉂  548 鈹   with Popen(*popenargs, **kwargs) as process:                      鈹
鈹                                                                              鈹
鈹 /usr/local/lib/python3.11/subprocess.py:1026 in __init__                     鈹
鈹                                                                              鈹
鈹 鉂 1026 鈹   鈹   鈹   self._execute_child(args, executable, preexec_fn, close_f 鈹
鈹                                                                              鈹
鈹 /usr/local/lib/python3.11/subprocess.py:1950 in _execute_child               鈹
鈹                                                                              鈹
鈹 鉂 1950 鈹   鈹   鈹   鈹   鈹   raise child_exception_type(errno_num, err_msg, er 鈹
鈺扳攢鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈺
FileNotFoundError: [Errno 2] No such file or directory: 'mpirun'

      Exit Code:    1
      Started:      Tue, 08 Aug 2023 17:11:17 -0300
      Finished:     Tue, 08 Aug 2023 17:11:19 -0300
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     500m
      memory:  1Gi
    Requests:
      cpu:     500m
      memory:  1Gi
    Environment:
      FLYTE_INTERNAL_EXECUTION_WORKFLOW:  flytesnacks:development:distributed_training.horovod_training_wf
      FLYTE_INTERNAL_EXECUTION_ID:        f691283563e5044edb39
      FLYTE_INTERNAL_EXECUTION_PROJECT:   flytesnacks
      FLYTE_INTERNAL_EXECUTION_DOMAIN:    development
      FLYTE_ATTEMPT_NUMBER:               3
      FLYTE_INTERNAL_TASK_PROJECT:        flytesnacks
      FLYTE_INTERNAL_TASK_DOMAIN:         development
      FLYTE_INTERNAL_TASK_NAME:           distributed_training.horovod_train_task
      FLYTE_INTERNAL_TASK_VERSION:        c1os32vnshq5Bl9727lWxA==
      FLYTE_INTERNAL_PROJECT:             flytesnacks
      FLYTE_INTERNAL_DOMAIN:              development
      FLYTE_INTERNAL_NAME:                distributed_training.horovod_train_task
      FLYTE_INTERNAL_VERSION:             c1os32vnshq5Bl9727lWxA==
      FLYTE_AWS_ACCESS_KEY_ID:            minio
      FLYTE_AWS_SECRET_ACCESS_KEY:        miniostorage
      FLYTE_AWS_ENDPOINT:                 <http://minio.flyte.svc.cluster.local:9000>
      OMPI_MCA_plm_rsh_agent:             /etc/mpi/kubexec.sh
      OMPI_MCA_orte_default_hostfile:     /etc/mpi/hostfile
      NVIDIA_VISIBLE_DEVICES:             
      NVIDIA_DRIVER_CAPABILITIES:         
    Mounts:
      /etc/mpi from mpi-job-config (rw)
      /opt/kube from mpi-job-kubectl (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-z4whm (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  mpi-job-kubectl:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  mpi-job-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      f691283563e5044edb39-n0-3-config
    Optional:  false
  kube-api-access-z4whm:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:                      <none>
y
Copy code
<http://cr.flyte.org/flyteorg/flytekit:py3.11-1.8.1|cr.flyte.org/flyteorg/flytekit:py3.11-1.8.1>
probably does not have mpi installed correctly
g
Mmm not sure why it says
FileNotFoundError: [Errno 2] No such file or directory: 'mpirun'
even though the image has an installation of MPI. I mean, if it's working with the one that I created
Hmm how could we check that?
Or how could I install it correctly? I thought that using the custom docker image was going to do the trick
oh wait
I lost track
did you build your own image?
g
haha no trouble at all
Yeap
y
so your task should be using the new image
@task(image=<your_new_image>)
you need to specify it
g
I tried running this workflow using the docker image that gets created with the Dockerfile that is on this thread 馃槃
OHHHHHHHHHH
Okok, let me try that out and I'll get back to you
If that makes the trick, I promise you a bottle of wine and some asado if you ever come to Argentina hahaha
y
but regardless, I think we were having some issues with correct failing mechanism. I assume when launcher fails the whole job should fail. Let me do some investigation later
yeah just keep me posted with the updates
g
Yeap, no problem
Just to check..could it be that instead of
@task(image="<image_name>")
, now it is
@task(container_image="<image_name>")
?
y
container_image is correct
sorry i was just trying to memorize it from top of my head, so can be inaccurate
g
No problem at all! Ok, let me just build the package and register it and I'll be back with the results, fingers crossed!
y
one thing to add is that you I don鈥檛 know your setup, but you probably want to push your image somewhere.
are you running on minikube or something?
g
Yeap, im using docker desktop and the Kubernetes that comes with it
Im running everything on my Mac, wanted to get everything working locally before I take it to the definitive k8's cluster (somewhat like a POC)
y
I am not sure about 鈥渄ocker desktop and the Kubernetes that comes with it鈥
how did you setup your k8s cluster?
g
I managed to register it, if it says that the workflows already exists鈥 does it mean that they got uploaded anyways? Or they just got ignored?
y
try register with a newer version
g
Ok, I'll try it now
Coming back to your question, i installed docker (and docker desktop) on my notebook and.. for some reason, docker desktop has some configurations that you can tweak, and in there you can enable an option that says "Enable Kubernetes"
Some minutes pass by and you have a k8's cluster running locally (im working with the default k8's cluster)
y
oh wow, that is something I just learnt. thanks for it
g
Yeah, sure! No trouble at all! I had some really good luck when I found that out, it really saved me haha
So, right now it shows like its running, but im not sure how I can give you some more info
y
you can check if
describe pod
gives you the correct image you specified
and you can log the launcher pod
g
Oh that's nice, ok I'll try that
It shows the following:
Copy code
kubectl get -o yaml mpijobs a7nfz7zcpnczvmnhqt6s-n0-0 -n flytesnacks-development
apiVersion: <http://kubeflow.org/v1|kubeflow.org/v1>
kind: MPIJob
metadata:
  annotations:
    <http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: "false"
  creationTimestamp: "2023-08-09T19:18:13Z"
  generation: 1
  labels:
    domain: development
    execution-id: a7nfz7zcpnczvmnhqt6s
    interruptible: "false"
    node-id: n0
    project: flytesnacks
    shard-key: "3"
    task-name: workflows-distributed-training-horovod-train-task
    workflow-name: workflows-distributed-training-horovod-training-wf
  name: a7nfz7zcpnczvmnhqt6s-n0-0
  namespace: flytesnacks-development
  ownerReferences:
  - apiVersion: <http://flyte.lyft.com/v1alpha1|flyte.lyft.com/v1alpha1>
    blockOwnerDeletion: true
    controller: true
    kind: flyteworkflow
    name: a7nfz7zcpnczvmnhqt6s
    uid: cef3f28b-bdf3-471a-bac3-0ace667f44d6
  resourceVersion: "70681"
  uid: 7d3e4d21-bd7d-424e-b0d4-af808d238d0c
spec:
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      restartPolicy: Never
      template:
        metadata: {}
        spec:
          affinity: {}
          containers:
          - args:
            - mpirun
            - --allow-run-as-root
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -x
            - NCCL_DEBUG=INFO
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - -np
            - "3"
            - python
            - /opt/venv/bin/entrypoint.py
            - pyflyte-execute
            - --inputs
            - <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-a7nfz7zcpnczvmnhqt6s/n0/data/inputs.pb>
            - --output-prefix
            - <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-a7nfz7zcpnczvmnhqt6s/n0/data/0>
            - --raw-output-data-prefix
            - <s3://my-s3-bucket/ar/a7nfz7zcpnczvmnhqt6s-n0-0>
            - --checkpoint-path
            - <s3://my-s3-bucket/ar/a7nfz7zcpnczvmnhqt6s-n0-0/_flytecheckpoints>
            - --prev-checkpoint
            - '""'
            - --resolver
            - flytekit.core.python_auto_container.default_task_resolver
            - --
            - task-module
            - workflows.distributed_training
            - task-name
            - horovod_train_task
            env:
            - name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
              value: flytesnacks:development:workflows.distributed_training.horovod_training_wf
            - name: FLYTE_INTERNAL_EXECUTION_ID
              value: a7nfz7zcpnczvmnhqt6s
            - name: FLYTE_INTERNAL_EXECUTION_PROJECT
              value: flytesnacks
            - name: FLYTE_INTERNAL_EXECUTION_DOMAIN
              value: development
            - name: FLYTE_ATTEMPT_NUMBER
              value: "0"
            - name: FLYTE_INTERNAL_TASK_PROJECT
              value: flytesnacks
            - name: FLYTE_INTERNAL_TASK_DOMAIN
              value: development
            - name: FLYTE_INTERNAL_TASK_NAME
              value: workflows.distributed_training.horovod_train_task
            - name: FLYTE_INTERNAL_TASK_VERSION
              value: "2"
            - name: FLYTE_INTERNAL_PROJECT
              value: flytesnacks
            - name: FLYTE_INTERNAL_DOMAIN
              value: development
            - name: FLYTE_INTERNAL_NAME
              value: workflows.distributed_training.horovod_train_task
            - name: FLYTE_INTERNAL_VERSION
              value: "2"
            - name: FLYTE_AWS_SECRET_ACCESS_KEY
              value: miniostorage
            - name: FLYTE_AWS_ENDPOINT
              value: <http://minio.flyte.svc.cluster.local:9000>
            - name: FLYTE_AWS_ACCESS_KEY_ID
              value: minio
            image: piloto_mpi:piloto
            name: mpi
            resources:
              limits:
                cpu: 500m
                memory: 1Gi
              requests:
                cpu: 500m
                memory: 1Gi
            terminationMessagePolicy: FallbackToLogsOnError
          restartPolicy: Never
    Worker:
      replicas: 3
      restartPolicy: Never
      template:
        metadata: {}
        spec:
          affinity: {}
          containers:
          - args:
            - mpirun
            - --allow-run-as-root
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -x
            - NCCL_DEBUG=INFO
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - -np
            - "3"
            - python
            - /opt/venv/bin/entrypoint.py
            - pyflyte-execute
            - --inputs
            - <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-a7nfz7zcpnczvmnhqt6s/n0/data/inputs.pb>
            - --output-prefix
            - <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-a7nfz7zcpnczvmnhqt6s/n0/data/0>
            - --raw-output-data-prefix
            - <s3://my-s3-bucket/ar/a7nfz7zcpnczvmnhqt6s-n0-0>
            - --checkpoint-path
            - <s3://my-s3-bucket/ar/a7nfz7zcpnczvmnhqt6s-n0-0/_flytecheckpoints>
            - --prev-checkpoint
            - '""'
            - --resolver
            - flytekit.core.python_auto_container.default_task_resolver
            - --
            - task-module
            - workflows.distributed_training
            - task-name
            - horovod_train_task
            env:
            - name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
              value: flytesnacks:development:workflows.distributed_training.horovod_training_wf
            - name: FLYTE_INTERNAL_EXECUTION_ID
              value: a7nfz7zcpnczvmnhqt6s
            - name: FLYTE_INTERNAL_EXECUTION_PROJECT
              value: flytesnacks
            - name: FLYTE_INTERNAL_EXECUTION_DOMAIN
              value: development
            - name: FLYTE_ATTEMPT_NUMBER
              value: "0"
            - name: FLYTE_INTERNAL_TASK_PROJECT
              value: flytesnacks
            - name: FLYTE_INTERNAL_TASK_DOMAIN
              value: development
            - name: FLYTE_INTERNAL_TASK_NAME
              value: workflows.distributed_training.horovod_train_task
            - name: FLYTE_INTERNAL_TASK_VERSION
              value: "2"
            - name: FLYTE_INTERNAL_PROJECT
              value: flytesnacks
            - name: FLYTE_INTERNAL_DOMAIN
              value: development
            - name: FLYTE_INTERNAL_NAME
              value: workflows.distributed_training.horovod_train_task
            - name: FLYTE_INTERNAL_VERSION
              value: "2"
            - name: FLYTE_AWS_SECRET_ACCESS_KEY
              value: miniostorage
            - name: FLYTE_AWS_ENDPOINT
              value: <http://minio.flyte.svc.cluster.local:9000>
            - name: FLYTE_AWS_ACCESS_KEY_ID
              value: minio
            image: piloto_mpi:piloto
            name: mpi
            resources:
              limits:
                cpu: 500m
                memory: 1Gi
              requests:
                cpu: 500m
                memory: 1Gi
            terminationMessagePolicy: FallbackToLogsOnError
          restartPolicy: Never
  runPolicy: {}
  slotsPerWorker: 1
status:
  conditions:
  - lastTransitionTime: "2023-08-09T19:18:13Z"
    lastUpdateTime: "2023-08-09T19:18:13Z"
    message: MPIJob flytesnacks-development/a7nfz7zcpnczvmnhqt6s-n0-0 is created.
    reason: MPIJobCreated
    status: "True"
    type: Created
  replicaStatuses:
    Launcher: {}
    Worker:
      active: 3
  startTime: "2023-08-09T19:18:13Z"
But now I was checking the last output of this command and it also stated that the image was
piloto_mpi:piloto
y
Copy code
piloto_mpi:piloto
should the image you build right?
g
When I create the package, I execute this command:
pyflyte --pkgs workflows package --image piloto_mpi:piloto
y
that is correct
g
Yeap, that would be the image
y
what鈥檚 the isssue now?
g
That im not sure if it's executing properly. I mean, on the propeller logs I see the following
E0809 19:18:13.614887       1 workers.go:102] error syncing 'flytesnacks-development/a7nfz7zcpnczvmnhqt6s': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [mpi]: found no current condition. Conditions: []
and when I look at the execution time of the job on the Flyte console.. I see that same exact time on the execution page of this workflow
The pods doesn't seem to be on error, but the launcher still on pending:
Copy code
kubectl get pods -n flytesnacks-development                                     
NAME                                 READY   STATUS      RESTARTS   AGE
a7nfz7zcpnczvmnhqt6s-n0-0-launcher   0/1     Pending     0          15m
a7nfz7zcpnczvmnhqt6s-n0-0-worker-0   1/1     Running     0          15m
a7nfz7zcpnczvmnhqt6s-n0-0-worker-1   1/1     Running     0          15m
a7nfz7zcpnczvmnhqt6s-n0-0-worker-2   1/1     Running     0          15m
y
Can you describe that pending pod?
I think it鈥檚 a out of resource issue
Try delete all the running worker pods that supposed to be kill
g
Yeah, sure! Brb with the results of the describe
OMG, you are greattttt, yeah! You were right about the resources issue:
Copy code
Events:
  Type     Reason            Age               From               Message
  ----     ------            ----              ----               -------
  Warning  FailedScheduling  8s (x5 over 20m)  default-scheduler  0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
How did you know? Hahaha
And.. some more important question, how do we fix it? 馃槀
y
I鈥檝e seen all these errors before 馃槀
delete all the pods that are running
g
The launcher and the worker pods? Or only the launcher? Or.. do you mean, like.. every pod
y
I mean the running worker pods with the launcher pod failed
g
Great鈥 hmm just out of curious, you know how to do that? I'll google it otherwise, no biggie
Haha
y
let me give you a command real quick
best to do now is probably delete all pods in that namespace and relaunch your task
g
Thxx! Im more used to use k8's from rancher, where I can simply kill a pod by entering the deployment screen and reducing the instances of that deployment, but no idea how to do that with commands
Okey, great
if I terminate the job.. that should kill them, right?
y
Copy code
kubectl delete --all pods -n flytesnacks-development
try this
g
And also..
requests=Resources(cpu="1", mem="2000Mi"),
I had this commented on the task annotation, if I uncomment it.. would that be enough? Or should I give it some more?
btw, my notebook has 8 cpu and 16GB of RAM
Ok, thx! Trying that command now
y
if you comment it out, it will use flyte default resource, which is like 500Mi
I doubt if that can even launch your image
g
Haha okok, would you say that those specs should be enough? I mean
requests=Resources(cpu="1", mem="2000Mi")
btw, the pods are all dead now. If we are okey with those specs I'll make a new package and relaunch the job 馃檶
y
that should be enough
g
Okey great! Fingers crossed!
Hmm.. it still says that it has insufficient memory
the describe shows the following:
Copy code
kubectl describe pod ahj2xvlznbr6t9knh79z-n0-0-launcher -n flytesnacks-development
Name:             ahj2xvlznbr6t9knh79z-n0-0-launcher
Namespace:        flytesnacks-development
Priority:         0
Service Account:  ahj2xvlznbr6t9knh79z-n0-0-launcher
Node:             <none>
Labels:           <http://training.kubeflow.org/job-name=ahj2xvlznbr6t9knh79z-n0-0|training.kubeflow.org/job-name=ahj2xvlznbr6t9knh79z-n0-0>
                  <http://training.kubeflow.org/job-role=master|training.kubeflow.org/job-role=master>
                  <http://training.kubeflow.org/operator-name=mpijob-controller|training.kubeflow.org/operator-name=mpijob-controller>
                  <http://training.kubeflow.org/replica-type=launcher|training.kubeflow.org/replica-type=launcher>
Annotations:      <none>
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    MPIJob/ahj2xvlznbr6t9knh79z-n0-0
Init Containers:
  kubectl-delivery:
    Image:      mpioperator/kubectl-delivery:latest
    Port:       <none>
    Host Port:  <none>
    Limits:
      cpu:                100m
      ephemeral-storage:  5Gi
      memory:             512Mi
    Requests:
      cpu:                100m
      ephemeral-storage:  5Gi
      memory:             512Mi
    Environment:
      TARGET_DIR:  /opt/kube
      NAMESPACE:   flytesnacks-development
    Mounts:
      /etc/mpi from mpi-job-config (rw)
      /opt/kube from mpi-job-kubectl (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2bp5h (ro)
Containers:
  mpi:
    Image:      piloto_mpi:piloto
    Port:       <none>
    Host Port:  <none>
    Args:
      mpirun
      --allow-run-as-root
      -bind-to
      none
      -map-by
      slot
      -x
      LD_LIBRARY_PATH
      -x
      PATH
      -x
      NCCL_DEBUG=INFO
      -mca
      pml
      ob1
      -mca
      btl
      ^openib
      -np
      3
      python
      /opt/venv/bin/entrypoint.py
      pyflyte-execute
      --inputs
      <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-ahj2xvlznbr6t9knh79z/n0/data/inputs.pb>
      --output-prefix
      <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-ahj2xvlznbr6t9knh79z/n0/data/0>
      --raw-output-data-prefix
      <s3://my-s3-bucket/im/ahj2xvlznbr6t9knh79z-n0-0>
      --checkpoint-path
      <s3://my-s3-bucket/im/ahj2xvlznbr6t9knh79z-n0-0/_flytecheckpoints>
      --prev-checkpoint
      ""
      --resolver
      flytekit.core.python_auto_container.default_task_resolver
      --
      task-module
      workflows.distributed_training
      task-name
      horovod_train_task
    Limits:
      cpu:     1
      memory:  2000Mi
    Requests:
      cpu:     1
      memory:  2000Mi
    Environment:
      FLYTE_INTERNAL_EXECUTION_WORKFLOW:  flytesnacks:development:workflows.distributed_training.horovod_training_wf
      FLYTE_INTERNAL_EXECUTION_ID:        ahj2xvlznbr6t9knh79z
      FLYTE_INTERNAL_EXECUTION_PROJECT:   flytesnacks
      FLYTE_INTERNAL_EXECUTION_DOMAIN:    development
      FLYTE_ATTEMPT_NUMBER:               0
      FLYTE_INTERNAL_TASK_PROJECT:        flytesnacks
      FLYTE_INTERNAL_TASK_DOMAIN:         development
      FLYTE_INTERNAL_TASK_NAME:           workflows.distributed_training.horovod_train_task
      FLYTE_INTERNAL_TASK_VERSION:        3
      FLYTE_INTERNAL_PROJECT:             flytesnacks
      FLYTE_INTERNAL_DOMAIN:              development
      FLYTE_INTERNAL_NAME:                workflows.distributed_training.horovod_train_task
      FLYTE_INTERNAL_VERSION:             3
      FLYTE_AWS_ENDPOINT:                 <http://minio.flyte.svc.cluster.local:9000>
      FLYTE_AWS_ACCESS_KEY_ID:            minio
      FLYTE_AWS_SECRET_ACCESS_KEY:        miniostorage
      OMPI_MCA_plm_rsh_agent:             /etc/mpi/kubexec.sh
      OMPI_MCA_orte_default_hostfile:     /etc/mpi/hostfile
      NVIDIA_VISIBLE_DEVICES:             
      NVIDIA_DRIVER_CAPABILITIES:         
    Mounts:
      /etc/mpi from mpi-job-config (rw)
      /opt/kube from mpi-job-kubectl (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2bp5h (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  mpi-job-kubectl:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  mpi-job-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      ahj2xvlznbr6t9knh79z-n0-0-config
    Optional:  false
  kube-api-access-2bp5h:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  3m28s  default-scheduler  0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
it seems to be that now it is working with the
2000Mi
y
try reduce your worker to 2 or 1
g
Does it really need that much?
y
let鈥檚 just make sure it works first
g
Ohh okey okey
Yeap, trying with only one worker now
y
might want to delete all the pods again before you try
g
Yeap, I remembered to do that
It still seems to be that is not enough memory, this is the output of the describe:
Copy code
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  95s   default-scheduler  0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
  Normal   Scheduled         90s   default-scheduler  Successfully assigned flytesnacks-development/aq4nm75p6r5ljxmf56bd-n0-3-launcher to docker-desktop
  Normal   Pulled            89s   kubelet            Container image "mpioperator/kubectl-delivery:latest" already present on machine
  Normal   Created           89s   kubelet            Created container kubectl-delivery
  Normal   Started           89s   kubelet            Started container kubectl-delivery
  Normal   Pulled            84s   kubelet            Container image "piloto_mpi:piloto" already present on machine
  Normal   Created           83s   kubelet            Created container mpi
  Normal   Started           83s   kubelet            Started container mpi
I think we should increase the value of the memory?
y
yeah we can try that, or lower memory of a task to 1000Mi or sth
g
I know how to do the first one, not sure how to do the second one haha
Sorry I have to run now, but as soon as I can I'll try to make the first one and if it happens that the second one is here it would be great! I hope it gets fixed with this and if not I hope I can catch you tomorrow, you are being great and also giving some excellent support about all this, thx!
y
np, you just set
requests=Resources(cpu="1", mem="1000Mi")
good luck
g
Hi everyone! It's me again haha, sorry to bother you another day. Apparently now the pods are starting correctly (there's no longer the error of
Insufficient memory
) since I added the following properties to the task annotation
requests=Resources(cpu="1", mem="1000Mi"),limits=Resources(cpu="2", mem="3000Mi"),
and this is what shows up now when I describe the pod:
Copy code
kubectl describe pod apgpsl92cr9brztxkqsc-n0-2-launcher -n flytesnacks-development
Name:             apgpsl92cr9brztxkqsc-n0-2-launcher
Namespace:        flytesnacks-development
Priority:         0
Service Account:  apgpsl92cr9brztxkqsc-n0-2-launcher
Node:             docker-desktop/192.168.65.4
Start Time:       Thu, 10 Aug 2023 08:44:49 -0300
Labels:           <http://training.kubeflow.org/job-name=apgpsl92cr9brztxkqsc-n0-2|training.kubeflow.org/job-name=apgpsl92cr9brztxkqsc-n0-2>
                  <http://training.kubeflow.org/job-role=master|training.kubeflow.org/job-role=master>
                  <http://training.kubeflow.org/operator-name=mpijob-controller|training.kubeflow.org/operator-name=mpijob-controller>
                  <http://training.kubeflow.org/replica-type=launcher|training.kubeflow.org/replica-type=launcher>
Annotations:      <none>
Status:           Failed
IP:               10.1.0.191
IPs:
  IP:           10.1.0.191
Controlled By:  MPIJob/apgpsl92cr9brztxkqsc-n0-2
Init Containers:
  kubectl-delivery:
    Container ID:   <docker://de736cc67b7c7e702257d377bcfb69556d638b7ce975360315ae566f3b41fd5>c
    Image:          mpioperator/kubectl-delivery:latest
    Image ID:       <docker-pullable://mpioperator/kubectl-delivery@sha256:8a4a24114e0bdc8df8f44e657baa6f5d47b24b1664b26c6f59e06575f8f21a55>
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 10 Aug 2023 08:44:49 -0300
      Finished:     Thu, 10 Aug 2023 08:44:55 -0300
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:                100m
      ephemeral-storage:  5Gi
      memory:             512Mi
    Requests:
      cpu:                100m
      ephemeral-storage:  5Gi
      memory:             512Mi
    Environment:
      TARGET_DIR:  /opt/kube
      NAMESPACE:   flytesnacks-development
    Mounts:
      /etc/mpi from mpi-job-config (rw)
      /opt/kube from mpi-job-kubectl (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-g99kl (ro)
Containers:
  mpi:
    Container ID:  <docker://78e78d32b2927bdf4a04d3bf714877de6bb0c84bcc84668598c776c84f9448d>6
    Image:         piloto_mpi:piloto
    Image ID:      <docker://sha256:5>b9119f28d46ff4859859c2f588b86a5d18e319705c44cdd3a0081e391851433
    Port:          <none>
    Host Port:     <none>
    Args:
      mpirun
      --allow-run-as-root
      -bind-to
      none
      -map-by
      slot
      -x
      LD_LIBRARY_PATH
      -x
      PATH
      -x
      NCCL_DEBUG=INFO
      -mca
      pml
      ob1
      -mca
      btl
      ^openib
      -np
      1
      python
      /opt/venv/bin/entrypoint.py
      pyflyte-execute
      --inputs
      <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-apgpsl92cr9brztxkqsc/n0/data/inputs.pb>
      --output-prefix
      <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-apgpsl92cr9brztxkqsc/n0/data/2>
      --raw-output-data-prefix
      <s3://my-s3-bucket/0x/apgpsl92cr9brztxkqsc-n0-2>
      --checkpoint-path
      <s3://my-s3-bucket/0x/apgpsl92cr9brztxkqsc-n0-2/_flytecheckpoints>
      --prev-checkpoint
      <s3://my-s3-bucket/pw/apgpsl92cr9brztxkqsc-n0-1/_flytecheckpoints>
      --resolver
      flytekit.core.python_auto_container.default_task_resolver
      --
      task-module
      workflows.distributed_training
      task-name
      horovod_train_task
    State:      Terminated
      Reason:   Error
      Message:  锟斤拷 295 鈹   鈹   鈹   鈹   return func(*args, **kwargs)                           鈹
鈹                                                                              鈹
鈹 /opt/venv/lib/python3.8/site-packages/flytekit/core/python_auto_container.py 鈹
鈹 :235 in load_task                                                            鈹
鈹                                                                              鈹
鈹 鉂 235 鈹   鈹   task_module = importlib.import_module(name=task_module)  # typ 鈹
鈹                                                                              鈹
鈹 /usr/lib/python3.8/importlib/__init__.py:127 in import_module                鈹
鈹                                                                              鈹
鈹 鉂 127 鈹   return _bootstrap._gcd_import(name[level:], package, level)        鈹
鈹 in _gcd_import:1014                                                          鈹
鈹 in _find_and_load:991                                                        鈹
鈹 in _find_and_load_unlocked:973                                               鈹
鈺扳攢鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈹鈺
ModuleNotFoundError: No module named 'workflows.distributed_training'
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[54850,1],0]
  Exit code:    1
--------------------------------------------------------------------------

      Exit Code:    1
      Started:      Thu, 10 Aug 2023 08:44:56 -0300
      Finished:     Thu, 10 Aug 2023 08:45:00 -0300
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  3000Mi
    Requests:
      cpu:     1
      memory:  1000Mi
    Environment:
      FLYTE_INTERNAL_EXECUTION_WORKFLOW:  flytesnacks:development:workflows.distributed_training.horovod_training_wf
      FLYTE_INTERNAL_EXECUTION_ID:        apgpsl92cr9brztxkqsc
      FLYTE_INTERNAL_EXECUTION_PROJECT:   flytesnacks
      FLYTE_INTERNAL_EXECUTION_DOMAIN:    development
      FLYTE_ATTEMPT_NUMBER:               2
      FLYTE_INTERNAL_TASK_PROJECT:        flytesnacks
      FLYTE_INTERNAL_TASK_DOMAIN:         development
      FLYTE_INTERNAL_TASK_NAME:           workflows.distributed_training.horovod_train_task
      FLYTE_INTERNAL_TASK_VERSION:        6
      FLYTE_INTERNAL_PROJECT:             flytesnacks
      FLYTE_INTERNAL_DOMAIN:              development
      FLYTE_INTERNAL_NAME:                workflows.distributed_training.horovod_train_task
      FLYTE_INTERNAL_VERSION:             6
      FLYTE_AWS_ENDPOINT:                 <http://minio.flyte.svc.cluster.local:9000>
      FLYTE_AWS_ACCESS_KEY_ID:            minio
      FLYTE_AWS_SECRET_ACCESS_KEY:        miniostorage
      OMPI_MCA_plm_rsh_agent:             /etc/mpi/kubexec.sh
      OMPI_MCA_orte_default_hostfile:     /etc/mpi/hostfile
      NVIDIA_VISIBLE_DEVICES:             
      NVIDIA_DRIVER_CAPABILITIES:         
    Mounts:
      /etc/mpi from mpi-job-config (rw)
      /opt/kube from mpi-job-kubectl (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-g99kl (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  mpi-job-kubectl:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  mpi-job-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      apgpsl92cr9brztxkqsc-n0-2-config
    Optional:  false
  kube-api-access-g99kl:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  91s   default-scheduler  Successfully assigned flytesnacks-development/apgpsl92cr9brztxkqsc-n0-2-launcher to docker-desktop
  Normal  Pulled     91s   kubelet            Container image "mpioperator/kubectl-delivery:latest" already present on machine
  Normal  Created    91s   kubelet            Created container kubectl-delivery
  Normal  Started    91s   kubelet            Started container kubectl-delivery
  Normal  Pulled     84s   kubelet            Container image "piloto_mpi:piloto" already present on machine
  Normal  Created    84s   kubelet            Created container mpi
  Normal  Started    84s   kubelet            Started container mpi
I believe that this might be the more relevant part
ModuleNotFoundError: No module named 'workflows.distributed_training'
btw, my project structure is as follow:
Copy code
piloto_mpi
鈹溾攢鈹 helm
鈹溾攢鈹 workflows
鈹         鈹溾攢鈹 distributed_training.py
鈹         鈹溾攢鈹 example.py
鈹         鈹溾攢鈹 logistic_regression_wine.py
鈹溾攢鈹 Dockerfile
鈹溾攢鈹 docker_build.sh
鈹溾攢鈹 flyte-package.tgz
鈹斺攢鈹 requirements.txt
As always, any help is greatly appreciated 馃槃
And this are the two commands that I execute to register the workflows to Flyte:
Copy code
pyflyte --pkgs workflows package --image piloto_mpi:piloto
flytectl register files --project flytesnacks --domain development --archive flyte-package.tgz --version 6
Hmmm ok it seems to be working now, apparently I had to re-build the image! 馃帀 Thx everyone for the help!
But there's still a question that remains in my head.. why did I need to re-build the docker image that I wanted to use? I mean, does that mean that I should re-build the image every time a make a change to the workflow code? Or only when I make changes to the properties passed to the
@task
annotation?
s
Since you're registering (but not fast-registering) your code, you need to build a Docker image every time you change your code. Can you try fast-registering your code? You can use this command:
pyflyte register --image <your-image> workflows
g
Ohh okey okey, I'll try with that one and see how it goes! Thx again for all the help! 馃檶
254 Views