Hey community - this may be a k8s issue and not related to Flyte, but I'm at a loss at the moment and thought someone in the community may have seen this before...
I'm seeing Flyte tasks/pods "stalling" with tiny cpu consumption, even though the cpu request is high (30), and there is plenty of availability on the node (according to my views via k9s and DataDog).
This behavior is observed when several of these tasks are scheduled at the same time, on the same node, via simultaneous workflows operating on different data. E.g. 3 such tasks (30 cores each) get scheduled onto a single node with 96 or more cores. One or two of the tasks will show it is using the 30 cores it requested, and complete in the expected time. The other task (or sometimes it is two of them) never start using CPU, as if they are being throttled severely. But all metrics show that the node has CPU to spare (sometimes a little, sometimes a lot).
If such a task is run by itself, it always works, using 30 cores. It is when several are scheduled together that I see this behavior, as if the sudden requests of 30 cpus by 3 pods causes the system to hold off on allocating one or two of them, and then the allocation never occurs... ? The tasks in question are executing a classifier on data in parallel batches via a ProcessPoolExecutor from concurrent.futures (standard python stuff).
It's tempting to assume there is a problem in our parallel implementation, except that it always works by itself, and how could pods be interacting in this way? But it could very well be a situation in which a task tries to get some CPU, is denied, and never "asks" again (I'm not clear on how the negotiation for CPU proceeds). Presumably the ProcessPoolExecutor is trying to execute in 30 processes, and whatever CPUs are allocated will be used to schedule these processes -- but the CPU never shows up.
Thanks for any experience/pointers you may have!