Hi Community! we want to use the MPI Plugin in or...
# ask-the-community
r
Hi Community! we want to use the MPI Plugin in order to launch distributed learning tasks on GCP. That plugin is using the MPI operator from Kubeflow. The MPI operator spawns launcher and worker pods. The launcher pod then needs to access the Flyte GCS bucket. Since we work with K8s service accounts to provide access to the GCS Flyte bucket the launcher pod cannot access the Flyte bucket, because we did not find a way to tell the MPI operator that it should use a certain K8s service account to launch the Launcher and worker pods. Do you have a solution for this? It is quite a blocker for us at the moment. Thanks for your advice!
k
Is this tensorflow or pytorch
Also let me take a look, cc @Kevin Su ?
f
You could bind the respective GCP service account via workload identities to the default service account used by the launcher and worker as a workaround maybe? 🤔
k
@Rob Ulbrich what service account are you passing in pyflyte run
r
The example workflow was tensorflow
Currently, we do not pass the service account directly, but have a service account per workspace that is set in the configuration
k
Hmm is that the default service account?
If you bind it to default it should work? Cc @Fabio Grätz
f
We don’t use MPI but we do use the default service account in every domain namespace and bind a GCP service account to it.
r
pyflyte run --remote -p workflow -d dev --service-account default  mnist.py horovod_training_wf
f
🤔
r
So I just ran the MPI Workflow with the
--service-account
flag. The service account to use is
default
.
But when you look at the service account, that the launcher is using, it is a totally different
f
Same happens when you don’t specify the service account in the pyflyte command?
r
Yes
f
It appears as if there is a service account created temporarily for each node/execution.
What is the service account of the worker?
k
We will have to look at the training operator 😤
f
The launcher might need a dedicated service account because it creates other pods? (Suggested by the name)
I wonder whether the workers have the default service account then at least.
Could you paste the manifest for the
SparkApplication
?
r
Okay, the worker use the service account, that I specify in
pyflyte
image.png
f
kubectl -n <namespace name> get serviceaccount default -o yaml
What is the output of this?
r
Copy code
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    <http://iam.gke.io/gcp-service-account|iam.gke.io/gcp-service-account>: flyte-workflow-dev-sa@<redacted>
  creationTimestamp: "2023-08-03T07:15:09Z"
  name: default
  namespace: dev
  resourceVersion: "676089"
  uid: dd7f67ce-2530-4c59-883d-2e08bebeebbb
f
Ok workflow identities are there
r
The
default
service acccont is the one that we want to use
f
Assuming the required IAM permissions are configured too.
r
It has the permissions to access the GCS buckets
Sure
For Pod Tasks everything works fine
f
Ok 👍
I don’t know enough about MPI tbh 😕
But does the launcher really need to access anything in GCP?
Or is its job only to create other pods?
r
Yes, I see error logs in the launcher pod, that it needs to access flyte data from the Flyte GCS bucket
f
Probably the code for fast registration 🤔
k
@Fabio Grätz MPI does not need a launcher - it’s a peer to peer protocol
f
Yes but what does the launcher pod do? 🤔
k
Yes @Rob Ulbrich does a regular pod task work with pyflyte run, if so this should work
Beats me for launcher
r
I guess we need the launcher pod, because I am running Flyte's own example code, where it uses the launcher 😉
Copy code
@task(
    container_image="europe-west4-docker.pkg.dev/<redacted>/flyte/mpi-mnist:latest",
    task_config=MPIJob(
        launcher=Launcher(
            replicas=1,
        ),
        worker=Worker(
            replicas=2,
        ),
    ),
    retries=3,
    requests=Resources(cpu="1", mem="1000Mi"),
    limits=Resources(cpu="2", mem="4000Mi")
)
I suppose the launcher communicates to the workers and sends data to them and receives the learned parameters and brings everything toegether
k
Haha, I think the launcher maybe a misnomer. Let us try today
f
Sorry, I’m mixing spark and MPI now 🤦‍♂️
@Dimss consider trying the v2 controller, which doesn’t depend on ServiceAccounts 🙂
Which version of mpi operator or training operator (?) are you running?
r
Seems to be image version v1-855e096
f
But mpi operator or training operator?
r
Copy code
<https://github.com/kubeflow/training-operator>
using branch v1.7-branch
Standalone
The training operator actually spawns the launcher and the workers
f
Mh ok 😕 The issue above suggested that there are versions that need a service account for the launcher whereas others don’t. But the training operator is definitely newer than the mpi operator. We don’t use mpi, I honestly don’t know the answer, sorry 🤷‍♂️
@Byron Hsu @Yubo Wang you guys use mpi, or am I mistaken?
k
ohh right they do
also maybe @Rahul Mehta?
@Samhita Alla also has run i think
r
Ah we actually are opting to use dask instead of the Kubeflow training operator
y
we do, but unfortunately we don’t use k8s service account. Our MPI job talks to hadoop only, no cloud storage😅
f
Do you build the code into the image/aka no fast registration?
y
we have our internal blob storage, it uses certs for authentication. And we mount the certs through some internal service
r
I found where the problem is located in the training operator and created an issue in the Github repository.
But I actually found another issue: If there is a sidecar container, then MPI jobs can't either be spawned, because the launcher connects to the workers using
kubectl exec
. It then connects to the sidecar container instead to the main container where the worker is located
k
@Rob Ulbrich so what do you want to train torch?
If so check out flytekit plugin for elastic
r
We have a proprietary ML app that needs MPI, can't use flytekit plugin for elastic
I will create an issue for Flyte
k
Please file one with details - we will try to reproduce it
r
I have decided to create another issue to the kubeflow team for the training operator, because it is easier to fix for them