Hi All I am running a distributed training job using Flyte + Flyte #flyte-support

Hi All. I am running a distributed training job us...

fierce-oil-47448

06/12/2024, 8:33 AM

Hi All. I am running a distributed training job using Flyte + Kubeflow training operator as suggested by the documentation. I just added a sidecar to my pod spec. A sidecar is an init container that has restartPolicy=Always, which makes it special so that main container does not wait for the sidecar to complete and the pod lifetime is only tied to the main container and not the sidecar. However, when I run multi-node training with the training operator, the restartPolicy=Always for the sidecar is removed, which turns it into a regular init container and then the pod never starts as this particular sidecar is designed to run forever. I am trying to debug at what place the restart policy is dropped. Is this something that happens within the kfpytorch plugin or is this something that happens inside the training operator. How can I find how my pod spec is being converted into PyTorchJob?

fierce-oil-47448

06/12/2024, 9:12 AM

When I print the PyTorchJob using kubeflow Python API, I don't see the restart policy, so I'm suspecting that it is a Flyte issue.

freezing-airport-6809

06/12/2024, 2:08 PM

It could be or the crd may not support it. Here are the plugins https://github.com/flyteorg/flyte/tree/master/flyteplugins/go/tasks/plugins/k8s/kfoperators There is a body of work to add pod templates to all these plugins. That has not been prioritized

thankful-minister-83577

06/12/2024, 4:43 PM

if you

kubectl get -o yaml pytorchjobs

is there enough information there to understand what’s happening?

thankful-minister-83577

06/12/2024, 4:43 PM

(i would check myself but i don’t have the crd installed)

fierce-oil-47448

06/12/2024, 5:24 PM

yeah, it is missing there

fierce-oil-47448

06/12/2024, 5:30 PM

I already checked that. So it is not propagated to the spec.

thankful-minister-83577

06/12/2024, 5:42 PM

i believe this is the code doing the translation… https://github.com/flyteorg/flyte/blob/master/flyteplugins/go/tasks/plugins/k8s/kfoperators/pytorch/pytorch.go#L46 do you see anything obviously missing?

fierce-oil-47448

06/12/2024, 5:49 PM

@thankful-minister-83577 I think I found the problem

fierce-oil-47448

06/12/2024, 5:49 PM

In here https://github.com/flyteorg/flyteplugins/blob/master/go/tasks/pluginmachinery/flytek8s/pod_helper.go we do: utils.UnmarshalStructToObj(target.K8SPod.PodSpec, &podSpec)

fierce-oil-47448

06/12/2024, 5:49 PM

Here, podSpec is of type v1.PodSpec

fierce-oil-47448

06/12/2024, 5:50 PM

v1.PodSpec comes form package k8s.io/api v0.24.1

fierce-oil-47448

06/12/2024, 5:50 PM

But 0.24.1 is way too old and does not include restartPolicy for containers (this was added more recently as part of sidecar support in later versions of K8)

fierce-oil-47448

06/12/2024, 5:51 PM

Proof: • 0.24.1 does not have restartPolicy for container: https://pkg.go.dev/k8s.io/api@v0.24.1/core/v1#Container

fierce-oil-47448

06/12/2024, 5:51 PM

Latest has: • https://pkg.go.dev/k8s.io/api@v0.30.2/core/v1#Container

fierce-oil-47448

06/12/2024, 5:57 PM

So we need to just bump up the version of k8s.io/api

fierce-oil-47448

06/12/2024, 5:57 PM

Are these plugins separately updatable?

thankful-minister-83577

06/12/2024, 6:07 PM

interesting. new feature, didn’t know about this.

thankful-minister-83577

06/12/2024, 6:09 PM

plugins all get compiled together.

thankful-minister-83577

06/12/2024, 6:09 PM

(these backend plugins do anyways, which is one of the reasons why we create the agent style of plugin)

thankful-minister-83577

06/12/2024, 6:10 PM

let us get back to you… it shouldn’t be a problem, just need to run a myriad of tests.

thankful-minister-83577

06/12/2024, 6:11 PM

mind cutting an issue for this? and thank you very much for the investigation

fierce-oil-47448

06/12/2024, 6:16 PM

Feature was introduced here: https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/

high-accountant-32689

06/12/2024, 6:19 PM

@fierce-oil-47448, what version of Flyte are you running?

fierce-oil-47448

06/12/2024, 6:19 PM

In the monorepo, the version is actually k8s.io/api v0.28.4

👍 1

fierce-oil-47448

06/12/2024, 6:19 PM

And has the field.

fierce-oil-47448

06/12/2024, 6:19 PM

Let me check

fierce-oil-47448

06/12/2024, 6:20 PM

@high-accountant-32689 v1.12.0

high-accountant-32689

06/12/2024, 6:23 PM

very interesting. So you should be running k8.api v0.28.2 (due to the rewrite), which means that the

RestartPolicy

field should be preserved. Let me take another look at the code.

fierce-oil-47448

06/12/2024, 6:29 PM

Yeah, looks like the theory does not hold.

high-accountant-32689

06/12/2024, 6:35 PM

how are you adding the sidecar to the pod spec?

thankful-minister-83577

06/12/2024, 6:35 PM

just to confirm, you are seeing the init container right? just not the restart policy?

fierce-oil-47448

06/12/2024, 6:42 PM

In the pytorchjob, I can see my init container, but its restartPolicy is missing.

fierce-oil-47448

06/12/2024, 6:43 PM

I am using @task's pod_template option

fierce-oil-47448

06/12/2024, 6:44 PM

And I can see all my customizations, except restartPolicy, which gets removed. When I do kubectl get -o yaml pytorchjobs I see all my customizations except the restartPolicy.

fierce-oil-47448

06/12/2024, 6:48 PM

From `kubectl get -o yaml pytorchjobs`:

Copy code

initContainers:
    - args:
       - |
        <REDACTED>
       command:
       - /bin/bash
       - -c
       env:
       - name: LD_LIBRARY_PATH
        value: /usr/local/nvidia/lib64/
       image: <REDACTED>
       imagePullPolicy: Always
       name: <REDACTED>
       resources: {}
       securityContext:
        privileged: true
       volumeMounts:
       - mountPath: /usr/local/nvidia
        name: nvidia-install-dir-host

^_ restartPolicy is missing.

fierce-oil-47448

06/12/2024, 6:56 PM

I am sure I am setting it right, because if I use single node training, where a pyrorchjob is not created, I see it in the pod spec.

high-accountant-32689

06/13/2024, 3:49 AM

@fierce-oil-47448, I finally got to debug this. I have evidence that Flyte is submits a

PyTorchJob

job to the kubeflow plugin that contains the correct value of

RestartPolicy

here.

high-accountant-32689

06/13/2024, 4:00 AM

Just to add to ^, when we use the kubeclient to submit the

PyTorchJob

we reach this line and if we expand the list of warnings (which is a field of

result

), we see this one:

Copy code

unknown field "spec.pytorchReplicaSpecs.Master.template.spec.initContainers[0].restartPolicy"

fierce-oil-47448

06/13/2024, 4:11 AM

Ok, then this is an issue with training operator

high-accountant-32689

06/13/2024, 3:32 PM

@fierce-oil-47448, I just tried the

1.8.0-rc.0

version of the operator and I can see the field set there.

fierce-oil-47448

06/14/2024, 7:40 AM

The latest stable version they have is using k8s.io/api/core/v1 0.24.1, which explains it. Looks like there will be a release sometime soon.

👍 2

fierce-oil-47448

06/16/2024, 3:55 AM

I started using the

1.8.0-rc.0

version and it has been working well so far.

👍 1

Open in Slack

Previous Next