https://flyte.org logo
#ask-the-community
Title
# ask-the-community
t

Tarmily Wen

11/18/2022, 11:43 PM
I followed the guide to installing MPI Operator, and tried running the example workflows for MPI. I run into this error.
k

Kevin Su

11/19/2022, 12:05 AM
Did you update plugin config?
Copy code
configmap:
  enabled_plugins:
    # -- Task specific configuration [structure](<https://pkg.go.dev/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/config#GetConfig>)
    tasks:
      # -- Plugins configuration, [structure](<https://pkg.go.dev/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/config#TaskPluginConfig>)
      task-plugins:
        # -- [Enabled Plugins](<https://pkg.go.dev/github.com/flyteorg/flyteplugins/go/tasks/config#Config>). Enable SageMaker*, Athena if you install the backend
        # plugins
        enabled-plugins:
          - container
          - sidecar
          - k8s-array
          - mpi
        default-for-task-types:
          container: container
          sidecar: sidecar
          container_array: k8s-array
          mpi: mpi
t

Tarmily Wen

11/19/2022, 12:05 AM
yes. And I confirmed on gke that flyte-propeller-config reflected that change
k

Ketan (kumare3)

11/19/2022, 12:07 AM
we will have to look under the hood
seems the mpi operator is not behaving correctly
s

Samhita Alla

11/21/2022, 3:57 AM
cc: @Yuvraj
y

Yuvraj

11/21/2022, 4:44 AM
@Tarmily Wen Which example are you using?
t

Tarmily Wen

11/21/2022, 5:35 PM
Copy code
<https://github.com/flyteorg/flytesnacks/releases/download/v0.3.112/snacks-cookbook-integrations-kubernetes-kfmpi.tar.gz>
y

Yuvraj

11/23/2022, 2:15 PM
@Tarmily Wen Sorry for late replay, I will try MPI operator today and let you know
k

Ketan (kumare3)

11/23/2022, 3:38 PM
Thank you @Yuvraj and sorry @Tarmily Wen for the delay on our side
@Tarmily Wen till then can you look at the kubernetes cluster and look for the crds
It seems the operator is not doing the right thing
t

Tarmily Wen

11/23/2022, 3:51 PM
I can see the crd at mpijobs
Copy code
kubectl -n flyte get crd
NAME                                             CREATED AT
<http://backendconfigs.cloud.google.com|backendconfigs.cloud.google.com>                  2022-11-15T00:00:37Z
<http://capacityrequests.internal.autoscaling.gke.io|capacityrequests.internal.autoscaling.gke.io>     2022-11-15T00:00:07Z
<http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com>                    2022-11-15T02:05:53Z
<http://frontendconfigs.networking.gke.io|frontendconfigs.networking.gke.io>                2022-11-15T00:00:38Z
<http://managedcertificates.networking.gke.io|managedcertificates.networking.gke.io>            2022-11-15T00:00:24Z
<http://memberships.hub.gke.io|memberships.hub.gke.io>                           2022-11-15T00:03:55Z
<http://mpijobs.kubeflow.org|mpijobs.kubeflow.org>                             2022-11-18T17:42:49Z
<http://serviceattachments.networking.gke.io|serviceattachments.networking.gke.io>             2022-11-15T00:00:41Z
<http://servicenetworkendpointgroups.networking.gke.io|servicenetworkendpointgroups.networking.gke.io>   2022-11-15T00:00:39Z
<http://updateinfos.nodemanagement.gke.io|updateinfos.nodemanagement.gke.io>                2022-11-15T00:00:40Z
<http://volumesnapshotclasses.snapshot.storage.k8s.io|volumesnapshotclasses.snapshot.storage.k8s.io>    2022-11-15T00:00:38Z
<http://volumesnapshotcontents.snapshot.storage.k8s.io|volumesnapshotcontents.snapshot.storage.k8s.io>   2022-11-15T00:00:39Z
<http://volumesnapshots.snapshot.storage.k8s.io|volumesnapshots.snapshot.storage.k8s.io>          2022-11-15T00:00:39Z
k

Ketan (kumare3)

11/23/2022, 3:52 PM
Let’s get all object under the crd and get the object for your execution
And read it with -o yaml
t

Tarmily Wen

11/23/2022, 3:55 PM
y

Yuvraj

11/24/2022, 1:52 PM
I tried the example and it is not working, I am getting this error from propeller
Copy code
{
  "json": {
    "exec_id": "aggfj525qt8stl9lnqk9",
    "ns": "flytesnacks-development",
    "res_ver": "4206",
    "routine": "worker-1",
    "wf": "flytesnacks:development:kfmpi.mpi_mnist.horovod_training_wf"
  },
  "level": "error",
  "msg": "Error when trying to reconcile workflow. Error [failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [mpi]: found no current condition. Conditions: []]. Error Type[*errors.NodeErrorWithCause]",
  "ts": "2022-11-24T13:48:59Z"
}
Propeller Config
Copy code
enabled_plugins.yaml: |
    tasks:
      task-plugins:
        default-for-task-types:
          container: container
          container_array: k8s-array
          mpi: mpi
          sidecar: sidecar
        enabled-plugins:
        - container
        - sidecar
        - k8s-array
        - mpi
@Dan Rammer (hamersaw) Any idea what can be the issue ?
k

Ketan (kumare3)

11/24/2022, 3:32 PM
I think something has changed in the operator. Older version might work better
y

Yuvraj

11/25/2022, 2:21 PM
We define plugin name in flytepropeller(K8s plugin) and i think it’s propeller error, propeller is not able to handle mpi plugin.
d

Dan Rammer (hamersaw)

12/01/2022, 3:50 PM
@Yuvraj this error is coming from the MPI plugin - not necessarily propeller. It seems to be coming from this line. I'm not sure this is a quick fix, but if you can repro maybe we should file a bug on this and make it a priority to fix.
6 Views