Hello there we have an issue and we were wondering if there Flyte #flyte-support

Hello there we have an issue and we were wondering...

mammoth-church-8767

06/05/2022, 2:42 PM

Hello there we have an issue and we were wondering if there was a way to go around it we got the following: • We have a namespace with a 3Go memory limit and default task memory request to 1Go • We run 3 workflows running a first task were we had

[1/1] currentAttempt done. Last Error: USER::task execution timeout [5m0s] expired

error. The reason is that the pod was trying to mount a secret volume that didn’t exist (there was a typo in the secret name) • The problem is that the deployment on K8s of the those tasks was still available after those 5 minutes, and that for hours, keeping the 3Go for themselves. Making the other tasks waiting. • Ultimately, after that deployments was “removed” the other tasks were picked up I would expect Flyte to terminate the deployments right after the first error no? Freeing the ressource usage for other task? Any clue?

freezing-airport-6809

06/05/2022, 2:55 PM

Yup, this can be enabled. Termination immediately after failure. Default is to keep the state to help in debugging and logs

freezing-airport-6809

06/05/2022, 2:55 PM

Will share the config cc @hallowed-mouse-14616

freezing-airport-6809

06/05/2022, 2:57 PM

So to understand in most cases the pod will not use resources unless it is in a back off error etc. k8s does not do a great job of surfacing these errors. Can you share details of the error you saw and we can improve this edge case handling for everyone

freezing-airport-6809

06/05/2022, 3:03 PM

Can you share the pod status yaml

freezing-airport-6809

06/05/2022, 3:04 PM

In the case where it was stuck

mammoth-church-8767

06/05/2022, 3:06 PM

I don’t have it anymore but those were the events

Unable to attach or mount volumes: unmounted volumes=[onxg542gnrqwwzk6], unattached volumes=[kube-api-access-8cmfw onxg542gnrqwwzk6 aws-iam-token]: timed out waiting for the condition

Copy code

MountVolume.SetUp failed for volume "onxg542gnrqwwzk6" : references non-existent secret key: password

freezing-airport-6809

06/05/2022, 3:10 PM

So timeout is initiated by Flyte. But ideally we could have failed earlier

freezing-airport-6809

06/05/2022, 3:11 PM

I'll try to replicate this problem and fail earlier

freezing-airport-6809

06/05/2022, 3:11 PM

Anyways will share the config one near a computer

mammoth-church-8767

06/05/2022, 3:11 PM

🙏

freezing-airport-6809

06/05/2022, 11:29 PM

@mammoth-church-8767 sorry for the delay, but you need to set this flag - https://docs.flyte.org/en/latest/deployment/cluster_config/scheduler_config.html#delete-resource-on-finalize-bool

freezing-airport-6809

06/05/2022, 11:34 PM

cc @billowy-sundown-31926 add to docs?

✅ 1

billowy-sundown-31926

06/06/2022, 4:45 AM

Yes Ketan

hallowed-mouse-14616

06/06/2022, 8:23 AM

@freezing-airport-6809 I have a PR out to fix this issue. Currently, we are calling "Finalize" during a permanent failure rather than "Abort". In some circumstances this can leave resources executing even though Flyte has moved on. Using the

delete-resource-on-finalize

configuration works, but I think this would be nice if that option worked orthogonally. You shouldn't need to update configuration to fix this right?

❤️ 1

mammoth-church-8767

06/06/2022, 11:32 AM

Thanks! 🙂

👍 2

197 Views

Open in Slack

Previous Next