I followed the <guide> to installing MPI Operator,...
# flyte-support
d
I followed the guide to installing MPI Operator, and tried running the example workflows for MPI. I run into this error.
g
Did you update plugin config?
Copy code
configmap:
  enabled_plugins:
    # -- Task specific configuration [structure](<https://pkg.go.dev/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/config#GetConfig>)
    tasks:
      # -- Plugins configuration, [structure](<https://pkg.go.dev/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/config#TaskPluginConfig>)
      task-plugins:
        # -- [Enabled Plugins](<https://pkg.go.dev/github.com/flyteorg/flyteplugins/go/tasks/config#Config>). Enable SageMaker*, Athena if you install the backend
        # plugins
        enabled-plugins:
          - container
          - sidecar
          - k8s-array
          - mpi
        default-for-task-types:
          container: container
          sidecar: sidecar
          container_array: k8s-array
          mpi: mpi
d
yes. And I confirmed on gke that flyte-propeller-config reflected that change
f
we will have to look under the hood
seems the mpi operator is not behaving correctly
t
cc: @great-school-54368
g
@delightful-computer-49028 Which example are you using?
d
Copy code
<https://github.com/flyteorg/flytesnacks/releases/download/v0.3.112/snacks-cookbook-integrations-kubernetes-kfmpi.tar.gz>
g
@delightful-computer-49028 Sorry for late replay, I will try MPI operator today and let you know
f
Thank you @great-school-54368 and sorry @delightful-computer-49028 for the delay on our side
@delightful-computer-49028 till then can you look at the kubernetes cluster and look for the crds
It seems the operator is not doing the right thing
d
I can see the crd at mpijobs
Copy code
kubectl -n flyte get crd
NAME                                             CREATED AT
<http://backendconfigs.cloud.google.com|backendconfigs.cloud.google.com>                  2022-11-15T00:00:37Z
<http://capacityrequests.internal.autoscaling.gke.io|capacityrequests.internal.autoscaling.gke.io>     2022-11-15T00:00:07Z
<http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com>                    2022-11-15T02:05:53Z
<http://frontendconfigs.networking.gke.io|frontendconfigs.networking.gke.io>                2022-11-15T00:00:38Z
<http://managedcertificates.networking.gke.io|managedcertificates.networking.gke.io>            2022-11-15T00:00:24Z
<http://memberships.hub.gke.io|memberships.hub.gke.io>                           2022-11-15T00:03:55Z
<http://mpijobs.kubeflow.org|mpijobs.kubeflow.org>                             2022-11-18T17:42:49Z
<http://serviceattachments.networking.gke.io|serviceattachments.networking.gke.io>             2022-11-15T00:00:41Z
<http://servicenetworkendpointgroups.networking.gke.io|servicenetworkendpointgroups.networking.gke.io>   2022-11-15T00:00:39Z
<http://updateinfos.nodemanagement.gke.io|updateinfos.nodemanagement.gke.io>                2022-11-15T00:00:40Z
<http://volumesnapshotclasses.snapshot.storage.k8s.io|volumesnapshotclasses.snapshot.storage.k8s.io>    2022-11-15T00:00:38Z
<http://volumesnapshotcontents.snapshot.storage.k8s.io|volumesnapshotcontents.snapshot.storage.k8s.io>   2022-11-15T00:00:39Z
<http://volumesnapshots.snapshot.storage.k8s.io|volumesnapshots.snapshot.storage.k8s.io>          2022-11-15T00:00:39Z
f
Let’s get all object under the crd and get the object for your execution
And read it with -o yaml
d
crd.yaml
g
I tried the example and it is not working, I am getting this error from propeller
Copy code
{
  "json": {
    "exec_id": "aggfj525qt8stl9lnqk9",
    "ns": "flytesnacks-development",
    "res_ver": "4206",
    "routine": "worker-1",
    "wf": "flytesnacks:development:kfmpi.mpi_mnist.horovod_training_wf"
  },
  "level": "error",
  "msg": "Error when trying to reconcile workflow. Error [failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [mpi]: found no current condition. Conditions: []]. Error Type[*errors.NodeErrorWithCause]",
  "ts": "2022-11-24T13:48:59Z"
}
Propeller Config
Copy code
enabled_plugins.yaml: |
    tasks:
      task-plugins:
        default-for-task-types:
          container: container
          container_array: k8s-array
          mpi: mpi
          sidecar: sidecar
        enabled-plugins:
        - container
        - sidecar
        - k8s-array
        - mpi
@hallowed-mouse-14616 Any idea what can be the issue ?
f
I think something has changed in the operator. Older version might work better
g
We define plugin name in flytepropeller(K8s plugin) and i think it’s propeller error, propeller is not able to handle mpi plugin.
h
@great-school-54368 this error is coming from the MPI plugin - not necessarily propeller. It seems to be coming from this line. I'm not sure this is a quick fix, but if you can repro maybe we should file a bug on this and make it a priority to fix.
🙌 1
158 Views