Hi Community we want to use the MPI Plugin in order to launc Flyte #flyte-support

Hi Community! we want to use the MPI Plugin in or...

adorable-engineer-57446

08/31/2023, 12:01 PM

Hi Community! we want to use the MPI Plugin in order to launch distributed learning tasks on GCP. That plugin is using the MPI operator from Kubeflow. The MPI operator spawns launcher and worker pods. The launcher pod then needs to access the Flyte GCS bucket. Since we work with K8s service accounts to provide access to the GCS Flyte bucket the launcher pod cannot access the Flyte bucket, because we did not find a way to tell the MPI operator that it should use a certain K8s service account to launch the Launcher and worker pods. Do you have a solution for this? It is quite a blocker for us at the moment. Thanks for your advice!

freezing-airport-6809

08/31/2023, 1:55 PM

Is this tensorflow or pytorch

freezing-airport-6809

08/31/2023, 1:57 PM

Also let me take a look, cc @glamorous-carpet-83516 ?

cool-lifeguard-49380

08/31/2023, 2:16 PM

You could bind the respective GCP service account via workload identities to the default service account used by the launcher and worker as a workaround maybe? 🤔

freezing-airport-6809

08/31/2023, 2:25 PM

But it seems that we pass the workflow service account correctly https://github.com/flyteorg/flyteplugins/blob/13465aeb7ef920d7db519c677c4509b9cb573041/go/tasks/plugins/k8s/kfoperators/mpi/mpi.go#L60

freezing-airport-6809

08/31/2023, 2:25 PM

@adorable-engineer-57446 what service account are you passing in pyflyte run

adorable-engineer-57446

08/31/2023, 2:30 PM

The example workflow was tensorflow

adorable-engineer-57446

08/31/2023, 2:31 PM

Currently, we do not pass the service account directly, but have a service account per workspace that is set in the configuration

freezing-airport-6809

08/31/2023, 2:35 PM

Hmm is that the default service account?

freezing-airport-6809

08/31/2023, 2:35 PM

If you bind it to default it should work? Cc @cool-lifeguard-49380

👍 1

cool-lifeguard-49380

08/31/2023, 2:36 PM

We don’t use MPI but we do use the default service account in every domain namespace and bind a GCP service account to it.

adorable-engineer-57446

08/31/2023, 2:37 PM

pyflyte run --remote -p workflow -d dev --service-account default  mnist.py horovod_training_wf

cool-lifeguard-49380

08/31/2023, 2:38 PM

🤔

adorable-engineer-57446

08/31/2023, 2:38 PM

So I just ran the MPI Workflow with the

--service-account

flag. The service account to use is

default

adorable-engineer-57446

08/31/2023, 2:38 PM

But when you look at the service account, that the launcher is using, it is a totally different

cool-lifeguard-49380

08/31/2023, 2:38 PM

Same happens when you don’t specify the service account in the pyflyte command?

adorable-engineer-57446

08/31/2023, 2:39 PM

Yes

cool-lifeguard-49380

08/31/2023, 2:39 PM

It appears as if there is a service account created temporarily for each node/execution.

cool-lifeguard-49380

08/31/2023, 2:40 PM

What is the service account of the worker?

freezing-airport-6809

08/31/2023, 2:41 PM

We will have to look at the training operator 😤

cool-lifeguard-49380

08/31/2023, 2:41 PM

The launcher might need a dedicated service account because it creates other pods? (Suggested by the name)

cool-lifeguard-49380

08/31/2023, 2:41 PM

I wonder whether the workers have the default service account then at least.

cool-lifeguard-49380

08/31/2023, 2:42 PM

Could you paste the manifest for the

SparkApplication

adorable-engineer-57446

08/31/2023, 2:42 PM

Okay, the worker use the service account, that I specify in

pyflyte

adorable-engineer-57446

08/31/2023, 2:43 PM

cool-lifeguard-49380

08/31/2023, 2:43 PM

kubectl -n <namespace name> get serviceaccount default -o yaml

cool-lifeguard-49380

08/31/2023, 2:43 PM

What is the output of this?

adorable-engineer-57446

08/31/2023, 2:45 PM

Copy code

apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    <http://iam.gke.io/gcp-service-account|iam.gke.io/gcp-service-account>: flyte-workflow-dev-sa@<redacted>
  creationTimestamp: "2023-08-03T07:15:09Z"
  name: default
  namespace: dev
  resourceVersion: "676089"
  uid: dd7f67ce-2530-4c59-883d-2e08bebeebbb

cool-lifeguard-49380

08/31/2023, 2:45 PM

Ok workflow identities are there

adorable-engineer-57446

08/31/2023, 2:45 PM

The

default

service acccont is the one that we want to use

cool-lifeguard-49380

08/31/2023, 2:45 PM

Assuming the required IAM permissions are configured too.

adorable-engineer-57446

08/31/2023, 2:45 PM

It has the permissions to access the GCS buckets

adorable-engineer-57446

08/31/2023, 2:45 PM

Sure

adorable-engineer-57446

08/31/2023, 2:45 PM

For Pod Tasks everything works fine

cool-lifeguard-49380

08/31/2023, 2:45 PM

Ok 👍

cool-lifeguard-49380

08/31/2023, 2:46 PM

I don’t know enough about MPI tbh 😕

cool-lifeguard-49380

08/31/2023, 2:46 PM

But does the launcher really need to access anything in GCP?

cool-lifeguard-49380

08/31/2023, 2:46 PM

Or is its job only to create other pods?

adorable-engineer-57446

08/31/2023, 2:46 PM

Yes, I see error logs in the launcher pod, that it needs to access flyte data from the Flyte GCS bucket

cool-lifeguard-49380

08/31/2023, 2:46 PM

Probably the code for fast registration 🤔

freezing-airport-6809

08/31/2023, 2:47 PM

@cool-lifeguard-49380 MPI does not need a launcher - it’s a peer to peer protocol

cool-lifeguard-49380

08/31/2023, 2:47 PM

Yes but what does the launcher pod do? 🤔

freezing-airport-6809

08/31/2023, 2:48 PM

Yes @adorable-engineer-57446 does a regular pod task work with pyflyte run, if so this should work

freezing-airport-6809

08/31/2023, 2:48 PM

Beats me for launcher

adorable-engineer-57446

08/31/2023, 2:50 PM

I guess we need the launcher pod, because I am running Flyte's own example code, where it uses the launcher 😉

🤦 1

adorable-engineer-57446

08/31/2023, 2:50 PM

Copy code

@task(
    container_image="europe-west4-docker.pkg.dev/<redacted>/flyte/mpi-mnist:latest",
    task_config=MPIJob(
        launcher=Launcher(
            replicas=1,
        ),
        worker=Worker(
            replicas=2,
        ),
    ),
    retries=3,
    requests=Resources(cpu="1", mem="1000Mi"),
    limits=Resources(cpu="2", mem="4000Mi")
)

adorable-engineer-57446

08/31/2023, 2:51 PM

I am trying to make https://github.com/flyteorg/flytesnacks/blob/master/examples/kfmpi_plugin/kfmpi_plugin/mpi_mnist.py work

adorable-engineer-57446

08/31/2023, 2:54 PM

I suppose the launcher communicates to the workers and sends data to them and receives the learned parameters and brings everything toegether

freezing-airport-6809

08/31/2023, 2:56 PM

Haha, I think the launcher maybe a misnomer. Let us try today

cool-lifeguard-49380

08/31/2023, 3:02 PM

Sorry, I’m mixing spark and MPI now 🤦‍♂️

cool-lifeguard-49380

08/31/2023, 3:04 PM

https://github.com/kubeflow/mpi-operator/issues/394

cool-lifeguard-49380

08/31/2023, 3:04 PM

@Dimss consider trying the v2 controller, which doesn’t depend on ServiceAccounts 🙂

cool-lifeguard-49380

08/31/2023, 3:04 PM

Which version of mpi operator or training operator (?) are you running?

adorable-engineer-57446

08/31/2023, 3:08 PM

Seems to be image version v1-855e096

cool-lifeguard-49380

08/31/2023, 3:09 PM

But mpi operator or training operator?

adorable-engineer-57446

08/31/2023, 3:10 PM

Copy code

<https://github.com/kubeflow/training-operator>

adorable-engineer-57446

08/31/2023, 3:11 PM

using branch v1.7-branch

adorable-engineer-57446

08/31/2023, 3:11 PM

Standalone

adorable-engineer-57446

08/31/2023, 3:12 PM

The training operator actually spawns the launcher and the workers

cool-lifeguard-49380

08/31/2023, 3:13 PM

Mh ok 😕 The issue above suggested that there are versions that need a service account for the launcher whereas others don’t. But the training operator is definitely newer than the mpi operator. We don’t use mpi, I honestly don’t know the answer, sorry 🤷‍♂️

cool-lifeguard-49380

08/31/2023, 3:13 PM

@polite-ability-4005 @limited-raincoat-94253 you guys use mpi, or am I mistaken?

freezing-airport-6809

08/31/2023, 4:10 PM

ohh right they do

freezing-airport-6809

08/31/2023, 4:10 PM

also maybe @elegant-australia-91422?

freezing-airport-6809

08/31/2023, 4:10 PM

@tall-lock-23197 also has run i think

elegant-australia-91422

08/31/2023, 4:11 PM

Ah we actually are opting to use dask instead of the Kubeflow training operator

limited-raincoat-94253

08/31/2023, 4:25 PM

we do, but unfortunately we don’t use k8s service account. Our MPI job talks to hadoop only, no cloud storage😅

cool-lifeguard-49380

08/31/2023, 5:28 PM

Do you build the code into the image/aka no fast registration?

limited-raincoat-94253

08/31/2023, 5:38 PM

we have our internal blob storage, it uses certs for authentication. And we mount the certs through some internal service

adorable-engineer-57446

09/01/2023, 2:09 PM

I found where the problem is located in the training operator and created an issue in the Github repository.

adorable-engineer-57446

09/01/2023, 2:10 PM

But I actually found another issue: If there is a sidecar container, then MPI jobs can't either be spawned, because the launcher connects to the workers using

kubectl exec

. It then connects to the sidecar container instead to the main container where the worker is located

😞 1

freezing-airport-6809

09/01/2023, 3:23 PM

@adorable-engineer-57446 so what do you want to train torch?

freezing-airport-6809

09/01/2023, 3:24 PM

If so check out flytekit plugin for elastic

adorable-engineer-57446

09/04/2023, 7:56 AM

We have a proprietary ML app that needs MPI, can't use flytekit plugin for elastic

adorable-engineer-57446

09/04/2023, 7:56 AM

I will create an issue for Flyte

freezing-airport-6809

09/04/2023, 3:01 PM

Please file one with details - we will try to reproduce it

adorable-engineer-57446

09/05/2023, 7:38 AM

I have decided to create another issue to the kubeflow team for the training operator, because it is easier to fix for them

🔥 1

Open in Slack

Previous Next