Hi All. I am running a distributed training job us...
# ask-the-community
b
Hi All. I am running a distributed training job using Flyte + Kubeflow training operator as suggested by the documentation. I just added a sidecar to my pod spec. A sidecar is an init container that has restartPolicy=Always, which makes it special so that main container does not wait for the sidecar to complete and the pod lifetime is only tied to the main container and not the sidecar. However, when I run multi-node training with the training operator, the restartPolicy=Always for the sidecar is removed, which turns it into a regular init container and then the pod never starts as this particular sidecar is designed to run forever. I am trying to debug at what place the restart policy is dropped. Is this something that happens within the kfpytorch plugin or is this something that happens inside the training operator. How can I find how my pod spec is being converted into PyTorchJob?
When I print the PyTorchJob using kubeflow Python API, I don't see the restart policy, so I'm suspecting that it is a Flyte issue.
k
It could be or the crd may not support it. Here are the plugins https://github.com/flyteorg/flyte/tree/master/flyteplugins/go/tasks/plugins/k8s/kfoperators There is a body of work to add pod templates to all these plugins. That has not been prioritized
y
if you
kubectl get -o yaml pytorchjobs
is there enough information there to understand what’s happening?
(i would check myself but i don’t have the crd installed)
b
yeah, it is missing there
I already checked that. So it is not propagated to the spec.
y
i believe this is the code doing the translation… https://github.com/flyteorg/flyte/blob/master/flyteplugins/go/tasks/plugins/k8s/kfoperators/pytorch/pytorch.go#L46 do you see anything obviously missing?
b
@Yee I think I found the problem
In here https://github.com/flyteorg/flyteplugins/blob/master/go/tasks/pluginmachinery/flytek8s/pod_helper.go we do: utils.UnmarshalStructToObj(target.K8SPod.PodSpec, &podSpec)
Here, podSpec is of type v1.PodSpec
v1.PodSpec comes form package k8s.io/api v0.24.1
But 0.24.1 is way too old and does not include restartPolicy for containers (this was added more recently as part of sidecar support in later versions of K8)
Proof: • 0.24.1 does not have restartPolicy for container: https://pkg.go.dev/k8s.io/api@v0.24.1/core/v1#Container
So we need to just bump up the version of k8s.io/api
Are these plugins separately updatable?
y
interesting. new feature, didn’t know about this.
plugins all get compiled together.
(these backend plugins do anyways, which is one of the reasons why we create the agent style of plugin)
let us get back to you… it shouldn’t be a problem, just need to run a myriad of tests.
mind cutting an issue for this? and thank you very much for the investigation
b
e
@Buğra Gedik, what version of Flyte are you running?
b
In the monorepo, the version is actually k8s.io/api v0.28.4
And has the field.
Let me check
@Eduardo Apolinario (eapolinario) v1.12.0
e
very interesting. So you should be running k8.api v0.28.2 (due to the rewrite), which means that the
RestartPolicy
field should be preserved. Let me take another look at the code.
b
Yeah, looks like the theory does not hold.
e
how are you adding the sidecar to the pod spec?
y
just to confirm, you are seeing the init container right? just not the restart policy?
b
In the pytorchjob, I can see my init container, but its restartPolicy is missing.
I am using @task's pod_template option
And I can see all my customizations, except restartPolicy, which gets removed. When I do kubectl get -o yaml pytorchjobs I see all my customizations except the restartPolicy.
From `kubectl get -o yaml pytorchjobs`:
Copy code
initContainers:
    - args:
       - |
        <REDACTED>
       command:
       - /bin/bash
       - -c
       env:
       - name: LD_LIBRARY_PATH
        value: /usr/local/nvidia/lib64/
       image: <REDACTED>
       imagePullPolicy: Always
       name: <REDACTED>
       resources: {}
       securityContext:
        privileged: true
       volumeMounts:
       - mountPath: /usr/local/nvidia
        name: nvidia-install-dir-host
^_ restartPolicy is missing.
I am sure I am setting it right, because if I use single node training, where a pyrorchjob is not created, I see it in the pod spec.
e
@Buğra Gedik, I finally got to debug this. I have evidence that Flyte is submits a
PyTorchJob
job to the kubeflow plugin that contains the correct value of
RestartPolicy
here.
Just to add to ^, when we use the kubeclient to submit the
PyTorchJob
we reach this line and if we expand the list of warnings (which is a field of
result
), we see this one:
Copy code
unknown field "spec.pytorchReplicaSpecs.Master.template.spec.initContainers[0].restartPolicy"
b
Ok, then this is an issue with training operator
e
@Buğra Gedik, I just tried the
1.8.0-rc.0
version of the operator and I can see the field set there.
b
The latest stable version they have is using k8s.io/api/core/v1 0.24.1, which explains it. Looks like there will be a release sometime soon.
I started using the
1.8.0-rc.0
version and it has been working well so far.