Hello there we have an issue and we were wondering...
# ask-the-community
j
Hello there we have an issue and we were wondering if there was a way to go around it we got the following: • We have a namespace with a 3Go memory limit and default task memory request to 1Go • We run 3 workflows running a first task were we had
[1/1] currentAttempt done. Last Error: USER::task execution timeout [5m0s] expired
error. The reason is that the pod was trying to mount a secret volume that didn’t exist (there was a typo in the secret name) • The problem is that the deployment on K8s of the those tasks was still available after those 5 minutes, and that for hours, keeping the 3Go for themselves. Making the other tasks waiting. • Ultimately, after that deployments was “removed” the other tasks were picked up I would expect Flyte to terminate the deployments right after the first error no? Freeing the ressource usage for other task? Any clue?
k
Yup, this can be enabled. Termination immediately after failure. Default is to keep the state to help in debugging and logs
Will share the config cc @Dan Rammer (hamersaw)
So to understand in most cases the pod will not use resources unless it is in a back off error etc. k8s does not do a great job of surfacing these errors. Can you share details of the error you saw and we can improve this edge case handling for everyone
Can you share the pod status yaml
In the case where it was stuck
j
I don’t have it anymore but those were the events
Unable to attach or mount volumes: unmounted volumes=[onxg542gnrqwwzk6], unattached volumes=[kube-api-access-8cmfw onxg542gnrqwwzk6 aws-iam-token]: timed out waiting for the condition
Copy code
MountVolume.SetUp failed for volume "onxg542gnrqwwzk6" : references non-existent secret key: password
k
So timeout is initiated by Flyte. But ideally we could have failed earlier
I'll try to replicate this problem and fail earlier
Anyways will share the config one near a computer
j
🙏
k
cc @Smriti Satyan add to docs?
1
s
Yes Ketan
d
@Ketan (kumare3) I have a PR out to fix this issue. Currently, we are calling "Finalize" during a permanent failure rather than "Abort". In some circumstances this can leave resources executing even though Flyte has moved on. Using the
delete-resource-on-finalize
configuration works, but I think this would be nice if that option worked orthogonally. You shouldn't need to update configuration to fix this right?
❤️ 1
j
Thanks! 🙂
👍 2
175 Views