Hi I am looking at flytekitplugins kfpytorch task py And I s Flyte #flyte-support

Hi. I am looking at flytekitplugins/kfpytorch/task...

fierce-oil-47448

06/19/2024, 11:36 PM

Hi. I am looking at flytekitplugins/kfpytorch/task.py And I see that when using the Elastic plugin, a clean_pod_policy is not being set. This results in resources not being cleaned up automatically. The same setting is there for PyTorch tasks. Am I missing something?

freezing-airport-6809

06/20/2024, 7:37 AM

hmm we have seen it being cleaned up

freezing-airport-6809

06/20/2024, 7:37 AM

cc @cool-lifeguard-49380 / @broad-monitor-993

broad-monitor-993

06/20/2024, 2:10 PM

Don’t know anything about this, Fabio may have more context

freezing-airport-6809

06/20/2024, 2:29 PM

When we run elastic we don’t see leaks right - maybe new

cool-lifeguard-49380

06/20/2024, 2:30 PM

We don’t see any but which version of the training operator do you run?

fierce-oil-47448

06/20/2024, 3:42 PM

We are using v1.8.0-rc.0 release

fierce-oil-47448

06/20/2024, 3:42 PM

This is due to earlier stable version not supporting sidecars

fierce-oil-47448

06/20/2024, 3:42 PM

Btw, I think this might be because they changed defaults

fierce-oil-47448

06/20/2024, 3:42 PM

Let me double check that

fierce-oil-47448

06/20/2024, 3:45 PM

https://github.com/kubeflow/training-operator/issues/1753

fierce-oil-47448

06/20/2024, 3:45 PM

For PyTorch jobs, it seems like it was always None

fierce-oil-47448

06/20/2024, 3:53 PM

@cool-lifeguard-49380 I don't see the clean pod policy being set by the flyte plugin, when using Elastic. And in the latest rc version of the training operator, it defaults to None.

fierce-oil-47448

06/20/2024, 3:56 PM

I am looking at the latest stable release here: https://github.com/kubeflow/training-operator/blob/v1.7-branch/pkg/apis/kubeflow.org/v1/pytorch_defaults.go The clean pod policy is set to None there as well.

cool-lifeguard-49380

06/21/2024, 8:11 AM

Will do some experiments tonight and get back to you 🙂

cool-lifeguard-49380

06/21/2024, 9:08 PM

I installed kubeflow training operator

1.8.0-rc0

and ran this workflow:

Copy code

from flytekit import task, workflow
from flytekitplugins.kfpytorch import Elastic


@task(
    task_config=Elastic(nnodes=2)
)
def test_cleanup():
    raise Exception("Non recoverable")


@workflow
def wf():
    test_cleanup()

The pod entry point catches the error so the pods are

Completed

, not failed:

Copy code

NAME                 READY  STATUS   RESTARTS  AGE
f3b52c20cc11c4c6496b-n0-0-worker-0  0/1   Completed  0     46s
f3b52c20cc11c4c6496b-n0-0-worker-1  0/1   Completed  0     46s

The completed pods are sticking around for a bit until e.g. the respective node gets deleted but they are not consuming any resources. I would say this is the expected behaviour. If you want to clean up completed pods I’d recommend configuring flytepropeller to delete the respective k8s resource when an attempt completes:

Copy code

configmap:
 k8s:
  plugins:
   k8s:
    delete-resource-on-finalize: true

With this, the pytorchjob is immediately deleted which causes the training operator to delete the pods as well. --- I’m now wondering whether my example above is too simple to cause the behaviour you describe. Could you please try what happens for you? If you have a minimal example that I could try, I’m happy to do that.

🙏 1

fierce-oil-47448

06/21/2024, 10:44 PM

@cool-lifeguard-49380 In my case, we have jobs where sometimes some ranks get stuck. So I have a few pods Completed and a few pods Running. The Pytorch job itself is marked complete. Since PyTorch job is in completed state, pods are not cleaned up. Can I ask another favor, in your set up, can you do ‘kubectl get services’? You’ll see that all are leaking (without delete resources on finalize set). We had 1000s of such services lying around. Delete resources on finalize is great, didn’t know that. But why does Elastic not expose run policy? I was able to monkey patch things and add a run policy, which makes it work as desired.

freezing-airport-6809

06/22/2024, 2:16 PM

Do you mean elastic python client

fierce-oil-47448

06/22/2024, 6:18 PM

Here: https://docs.flyte.org/en/latest/api/flytekit/plugins/generated/flytekitplugins.kfpytorch.Elastic.html It exists here: https://docs.flyte.org/en/latest/api/flytekit/plugins/generated/flytekitplugins.kfpytorch.PyTorch.html See run_policy

cool-lifeguard-49380

06/24/2024, 6:13 AM

> Delete resources on finalize is great, didn’t know that. But why does Elastic not expose run policy? I was able to monkey patch things and add a run policy, which makes it work as desired. Would you be willing to open a PR to expose the run policy in elastic tasks as well?

cool-lifeguard-49380

06/24/2024, 6:14 AM

You could tag me as reviewer 🙏

❤️ 1

fierce-oil-47448

06/24/2024, 3:50 PM

Sure

🙏 1

cool-lifeguard-49380

06/24/2024, 3:55 PM

My gh handle is fg91

fierce-oil-47448