Hi all, I have a general Flyte/K8s scheduling ques...
# flyte-support
m
Hi all, I have a general Flyte/K8s scheduling question that I'd love some feedback on. I have a task decorated like so:
Copy code
@task(requests=Resources(cpu="48", mem="24Gi"), limits=Resources(cpu="64", mem="64Gi"))
Normally I use limits to not leave available resources unused -- above I think of this as saying "I definitely need 48 cores to get my job done, but I can use up to 64". But in practice, because scheduling only looks at requests, apparently without regard for actual node utilization, this is problematic. My task above got scheduled onto a node with 64 cores, and even though my task has all 64cores pegged at 100%, other tasks are getting scheduled onto the same node (and failing to start, waiting in the "initializing" state). K8s thinks there is room on the node, since I only requested 48 cores, and the node has 64. In fact, even "system level" pods seem to have difficulty running well because my task is pegging all 64cores at 100%, and no capacity appears to have been reserved for other admin-type pods on the node. In case it's not obvious, know that I'm fairly new to K8s scheduling, and in fact I'm not the person that has setup the cluster or manages scheduling configuration (we use Karpenter for some of this), but I'm trying to understand this myself so I can be useful to my team in solving it. I've done some googling but not gaining much insight beyond the link posted above. Thanks! Thomas
a
Hi Thomas! Well, scheduling is one of the most interesting aspects of K8s IMO. TL;DR the K8s scheduler uses a scoring mechanism to determine on which to node to schedule a Pod. By default it won't account for utilization but allocation (probably the reason you're seeing scheduled Pods on a node with high CPU utilization). The reasoning is that resource utilization is a very spurious metric for resources like Pods. There are ways to tweak the scheduler to, for example, assign a higher score to nodes with lower resource utilization (see https://kubernetes.io/docs/concepts/scheduling-eviction/resource-bin-packing/) In fact, the entire Scheduling, Preemption and Eviction section is a great resource. I hope this is helpful
q
Also note that in many cases setting CPU limits (not requests) results in unused CPU due to throttling, see also https://home.robusta.dev/blog/stop-using-cpu-limits But the CPU "shares" (so slices that each pod gets) are assigned/wheighted based on CPU requests, so you should not be able to completely starve other pods or system services.
gratitude thank you 1
🙌🏽 1
m
@average-finland-92144 @quaint-diamond-37493 thanks both for your feedback. @average-finland-92144 I understand node utilization is a spurious metric, so it makes sense to consider not just utilization, but max(sum(pod-requests),actual-node-utilization), and in this case if a node reported 100% utilization, even "spuriously", it would be down-voted as a target node for a task. @quaint-diamond-37493 - I read that post, and related links. I think not using CPU limits is probably the right answer from various perspectives, but I'm not sure it would have helped in the case I witnessed. To use the water-bottle analogy from that post, my story is: Marcus is alone with his water-bottle. He is very thirsty, so, whether his limit was >3 liters or there was no limit at all, either way he gets to drink all (3L) the water. Theresa (a newly scheduled pod) then shows up wanting some water. But it's all gone! I'm not very clear about what happens when that new pod shows up and another pod is already using all cores. Since the busy pod is using more than its request, (64 vs 48) presumably it could have some "taken away" by scheduling the 64 processes across 48 cores, causing computation to slow, but freeing up 16 cores for other tasks. But this did not happen. And it's not clear if the behavior would have been different absent a cpu-limit -- in that case too it would have been allowed to consume all 64 cores, since initially, it was alone on the node and they were "spare".
q
You are right that a limit equal to the number of cores should be effectively the same as no limit. But when the a new pod is scheduled, it will be guaranteed the requested cpu, the existing one will be effectively granted less cpu cycle "slices". So from my understanding of cgroups, if pods other pods don't get enough cpu, it's because the rely on their limits and their request is not high enough...
153 Views