Is anybody looking into integration points that mi...
# contribute
e
Is anybody looking into integration points that might exist between https://github.com/kubernetes-sigs/kueue and Flyte? GKE at least seems to have plan to integrate with this directly for job queuing (we're in contact with them about capacity issues)
Specifically, does anyone know what Propeller's reaction to a pod having
suspend: True
specified is? as long as it doesn't cause killing of the workflow, I think they should be able to work fine together
d
I'm not aware of any potential integrations. We have broadly explored similar queueing solutions in the past though, without any real strong adoption effort. Are you envisioning that propeller, rather than creating a k8s pod to execute a task, it writes the task to kueue and waits for it to be picked up? We might need a specific backend plugin for this to correctly identify task phases for jobs in different states of kueue execution. With a maybe naive view of this, it sounds like it wouldn't be terribly difficult.
e
We're going to try just a user-side integration (essentially we use pod-template for relevant pods since kueue just requires an annotation to know that it should be managing a pod). The only concern there is if propeller does not know how to handle a pod with
suspend: True
. Longer term I could see having job queues be a little bit more natively integrated as a way of providing priority/resource management, but don't have a clear sense of what that would look like.
d
Oh interesting, so basically you just annotate the pod and that tells kueue that it should be managing it. I assumed (again naievely based on other jobqueue implementations) that it was more complex - ie. write a
job
to kueue, wait for it to create a pod, etc. Propeller bases task status off pod / container status. So if kueue just leaves the Pod in
pending
until it decides to execute there should be no issue. Very interested to see how this goes and happy to help if you run into issues!
e
Yeah, AFAICT Kueue just watches pods with the annotation, and puts them in a queue, then when they receive resources, it just sets
suspend: False
Ahh unfortunately it appears that only Job resources have the
suspend
field... So Flyte would need to change tasks to run as jobs and not pods.... no idea how involved that would be. Correction: Looks like it will work just fine, just leave the pod in pending, provided you enable managing bare pods: https://kueue.sigs.k8s.io/docs/tasks/run_plain_pods/. We'll do a little write-up if we get it working
d
Sounds good! Looking forward to hearing back on this.
k
Ya jobs are terrible - prefer pods please
d
We'll do a little write-up if we get it working
hey @Eli Bixby, let us know, we can collaborate on this and post it on Flyte's blog too