Hi. I am looking at flytekitplugins/kfpytorch/task...
# flyte-support
f
Hi. I am looking at flytekitplugins/kfpytorch/task.py And I see that when using the Elastic plugin, a clean_pod_policy is not being set. This results in resources not being cleaned up automatically. The same setting is there for PyTorch tasks. Am I missing something?
f
hmm we have seen it being cleaned up
cc @cool-lifeguard-49380 / @broad-monitor-993
b
Don’t know anything about this, Fabio may have more context
f
When we run elastic we don’t see leaks right - maybe new
c
We don’t see any but which version of the training operator do you run?
f
We are using v1.8.0-rc.0 release
This is due to earlier stable version not supporting sidecars
Btw, I think this might be because they changed defaults
Let me double check that
For PyTorch jobs, it seems like it was always None
@cool-lifeguard-49380 I don't see the clean pod policy being set by the flyte plugin, when using Elastic. And in the latest rc version of the training operator, it defaults to None.
I am looking at the latest stable release here: https://github.com/kubeflow/training-operator/blob/v1.7-branch/pkg/apis/kubeflow.org/v1/pytorch_defaults.go The clean pod policy is set to None there as well.
c
Will do some experiments tonight and get back to you 🙂
I installed kubeflow training operator
1.8.0-rc0
and ran this workflow:
Copy code
from flytekit import task, workflow
from flytekitplugins.kfpytorch import Elastic


@task(
    task_config=Elastic(nnodes=2)
)
def test_cleanup():
    raise Exception("Non recoverable")


@workflow
def wf():
    test_cleanup()
The pod entry point catches the error so the pods are
Completed
, not failed:
Copy code
NAME                 READY  STATUS   RESTARTS  AGE
f3b52c20cc11c4c6496b-n0-0-worker-0  0/1   Completed  0     46s
f3b52c20cc11c4c6496b-n0-0-worker-1  0/1   Completed  0     46s
The completed pods are sticking around for a bit until e.g. the respective node gets deleted but they are not consuming any resources. I would say this is the expected behaviour. If you want to clean up completed pods I’d recommend configuring flytepropeller to delete the respective k8s resource when an attempt completes:
Copy code
configmap:
 k8s:
  plugins:
   k8s:
    delete-resource-on-finalize: true
With this, the pytorchjob is immediately deleted which causes the training operator to delete the pods as well. --- I’m now wondering whether my example above is too simple to cause the behaviour you describe. Could you please try what happens for you? If you have a minimal example that I could try, I’m happy to do that.
🙏 1
f
@cool-lifeguard-49380 In my case, we have jobs where sometimes some ranks get stuck. So I have a few pods Completed and a few pods Running. The Pytorch job itself is marked complete. Since PyTorch job is in completed state, pods are not cleaned up. Can I ask another favor, in your set up, can you do ‘kubectl get services’? You’ll see that all are leaking (without delete resources on finalize set). We had 1000s of such services lying around. Delete resources on finalize is great, didn’t know that. But why does Elastic not expose run policy? I was able to monkey patch things and add a run policy, which makes it work as desired.
f
Do you mean elastic python client
c
> Delete resources on finalize is great, didn’t know that. But why does Elastic not expose run policy? I was able to monkey patch things and add a run policy, which makes it work as desired. Would you be willing to open a PR to expose the run policy in elastic tasks as well?
You could tag me as reviewer 🙏
❤️ 1
f
Sure
🙏 1
c
My gh handle is fg91
f
c
Thank you 🙏 Will review until end of week, sorry I can’t promise earlier 😕