fierce-oil-47448
06/19/2024, 11:36 PMfreezing-airport-6809
freezing-airport-6809
broad-monitor-993
06/20/2024, 2:10 PMfreezing-airport-6809
cool-lifeguard-49380
06/20/2024, 2:30 PMfierce-oil-47448
06/20/2024, 3:42 PMfierce-oil-47448
06/20/2024, 3:42 PMfierce-oil-47448
06/20/2024, 3:42 PMfierce-oil-47448
06/20/2024, 3:42 PMfierce-oil-47448
06/20/2024, 3:45 PMfierce-oil-47448
06/20/2024, 3:45 PMfierce-oil-47448
06/20/2024, 3:53 PMfierce-oil-47448
06/20/2024, 3:56 PMcool-lifeguard-49380
06/21/2024, 8:11 AMcool-lifeguard-49380
06/21/2024, 9:08 PM1.8.0-rc0
and ran this workflow:
from flytekit import task, workflow
from flytekitplugins.kfpytorch import Elastic
@task(
task_config=Elastic(nnodes=2)
)
def test_cleanup():
raise Exception("Non recoverable")
@workflow
def wf():
test_cleanup()
The pod entry point catches the error so the pods are Completed
, not failed:
NAME READY STATUS RESTARTS AGE
f3b52c20cc11c4c6496b-n0-0-worker-0 0/1 Completed 0 46s
f3b52c20cc11c4c6496b-n0-0-worker-1 0/1 Completed 0 46s
The completed pods are sticking around for a bit until e.g. the respective node gets deleted but they are not consuming any resources.
I would say this is the expected behaviour.
If you want to clean up completed pods I’d recommend configuring flytepropeller to delete the respective k8s resource when an attempt completes:
configmap:
k8s:
plugins:
k8s:
delete-resource-on-finalize: true
With this, the pytorchjob is immediately deleted which causes the training operator to delete the pods as well.
---
I’m now wondering whether my example above is too simple to cause the behaviour you describe. Could you please try what happens for you? If you have a minimal example that I could try, I’m happy to do that.fierce-oil-47448
06/21/2024, 10:44 PMfreezing-airport-6809
fierce-oil-47448
06/22/2024, 6:18 PMcool-lifeguard-49380
06/24/2024, 6:13 AMcool-lifeguard-49380
06/24/2024, 6:14 AMfierce-oil-47448
06/24/2024, 3:50 PMcool-lifeguard-49380
06/24/2024, 3:55 PMfierce-oil-47448
06/26/2024, 5:39 AMcool-lifeguard-49380
06/26/2024, 6:59 AM