Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

<#CP2HDHKE1|> Hello All, My flyte workflows are running on a k8s cluster. Workflow was 6 nodes and each node requests 1 CPU. What happens is that by the end of the workflow 6 nodes are requesting 6 CPU's. The workflow succeeds but the CPU is not released from request. This means that once I run this workflow like 20 times, my cluster is already at 120 CPU's and after that the jobs get OOMKilled.

Has anyone gone through this. How to get out of this pickle....

Do you mean that the pods aren't terminating? Sometimes that happens to us too, and we're not 100% sure why either.

yes the workflow node pods just say succeeded and the CPU request consumption comes down only when we delete the workflow from k8s

if so, then you will need to enable a flag