Thomas Blom

05/04/2023, 5:35 PM
Hi all, I have a general Flyte/K8s scheduling question that I'd love some feedback on. I have a task decorated like so:
@task(requests=Resources(cpu="48", mem="24Gi"), limits=Resources(cpu="64", mem="64Gi"))
Normally I use limits to not leave available resources unused -- above I think of this as saying "I definitely need 48 cores to get my job done, but I can use up to 64". But in practice, because scheduling only looks at requests, apparently without regard for actual node utilization, this is problematic. My task above got scheduled onto a node with 64 cores, and even though my task has all 64cores pegged at 100%, other tasks are getting scheduled onto the same node (and failing to start, waiting in the "initializing" state). K8s thinks there is room on the node, since I only requested 48 cores, and the node has 64. In fact, even "system level" pods seem to have difficulty running well because my task is pegging all 64cores at 100%, and no capacity appears to have been reserved for other admin-type pods on the node. In case it's not obvious, know that I'm fairly new to K8s scheduling, and in fact I'm not the person that has setup the cluster or manages scheduling configuration (we use Karpenter for some of this), but I'm trying to understand this myself so I can be useful to my team in solving it. I've done some googling but not gaining much insight beyond the link posted above. Thanks! Thomas

David Espejo (he/him)

05/04/2023, 6:00 PM
Hi Thomas! Well, scheduling is one of the most interesting aspects of K8s IMO. TL;DR the K8s scheduler uses a scoring mechanism to determine on which to node to schedule a Pod. By default it won't account for utilization but allocation (probably the reason you're seeing scheduled Pods on a node with high CPU utilization). The reasoning is that resource utilization is a very spurious metric for resources like Pods. There are ways to tweak the scheduler to, for example, assign a higher score to nodes with lower resource utilization (see In fact, the entire Scheduling, Preemption and Eviction section is a great resource. I hope this is helpful

Felix Ruess

05/04/2023, 6:05 PM
Also note that in many cases setting CPU limits (not requests) results in unused CPU due to throttling, see also But the CPU "shares" (so slices that each pod gets) are assigned/wheighted based on CPU requests, so you should not be able to completely starve other pods or system services.
You are right that a limit equal to the number of cores should be effectively the same as no limit. But when the a new pod is scheduled, it will be guaranteed the requested cpu, the existing one will be effectively granted less cpu cycle "slices". So from my understanding of cgroups, if pods other pods don't get enough cpu, it's because the rely on their limits and their request is not high enough...