rapid-artist-48509
04/04/2025, 11:38 PMrapid-artist-48509
04/04/2025, 11:41 PMprojectQuotaGpu
doesn't appear to exist in flyte github repo.
second of all, there's a k8s way to set resource quotas for a namespace, but this just makes k8s push back versus the flyte propeller: https://kubernetes.io/docs/concepts/policy/resource-quotas/
(i want the flyte propeller to push back, because i might have multiple projects and because i want flyte to properly show a job as queued vs running (i.e. dequeued to k8s)
third, there's this old thread https://discuss.flyte.org/t/13550344/hello-nice-to-meet-you-all-slightly-smiling-face-i-was-looki#a1644d79-90d6-4cbf-ac73-aaf76e3c2b40 But again that seems to use the non-existent symbol ``projectQuotaGpu`rapid-artist-48509
04/05/2025, 12:01 AMprojectQuotaCpu
and flyte WebUI put them all in "running" state, tho I can see the pods are obvi "pending". Does this all only work in the non-single-binary install of Flyte?rapid-artist-48509
04/05/2025, 12:05 AMmap_task
. But here I need the workflow executions to show "queued" properly.rapid-artist-48509
04/05/2025, 12:29 AMprojectQuotaCpu
? i would think whole CPUs but when I edit this around it seems it might actually be millis?rapid-artist-48509
04/07/2025, 3:36 PMaverage-finland-92144
04/07/2025, 9:02 PMprojectQuotaCpu
) are arbitrary so you could define projectQuotaGPU
average-finland-92144
04/07/2025, 9:03 PMrapid-artist-48509
04/07/2025, 11:15 PMflytectl update cluster-resource-attribute --attrFile cra.yaml
way and:
• I couldn't see how to spec GPUS ... maybe it's just the nvidia gpu tag?
• I believe I was using k8s units as expected but yeah things didn't workrapid-artist-48509
04/08/2025, 5:17 PMprojectQuotaCpu
of 10 and tasks/ workflows that each requested / limit of 9 CPUs. The workflows do NOT queue as expected, they go straight to K8s where they hang around in NotReady
for a while and then eventually get run.
does all this NOT work for flyte-single-binary deployment? also I see no code in the flyte
repo for project GPU quotas, so is this not a OSS feature?average-finland-92144
04/08/2025, 10:24 PMrapid-artist-48509
04/09/2025, 5:05 PMpyflyte run
, so many that all the resources get used and workflows have to queue up. Right now, the Flyte WebUI always shows these queued executions as "Running" even tho they are effectively queued by K8s (and pods are in the "pending" state). This behavior is confusing to users, as they think all their jobs have started and yet clearly they are stuck in line. So my questions I guess are:
• What can I do, if anything, to get the workflows to properly display as "queued" in the executions view? I guess I thought that specifying quotas would help, but is there any other way? I think Ketan said that once a workflow / task gets sent to k8s, it's always "Running" according to Flyte. So I was trying to see how to keep workflows from getting dispatched w/out respect to resources.
• If there really isn't any facility for the above bullet point, would perhaps my only avenue to be e.g. a feature request where perhaps workflows can report the actual pod state, i.e. "Pending" ?
At the end of the day, a use case is like "the user kicked off 100 workflow runs at night, in the morning 90 were done or running and user wants to decide if the other 10 should just be terminated. But it's unclear if the other 10 ever started"average-finland-92144
04/09/2025, 10:56 PMTimeline
tab of the UI for one of those "Running" tasks?
When you hover over the execution bar it displays phases.
Also, this dashboard includes a metric for Tasks whose Pod is in Pending
state in K8s:
https://grafana.com/grafana/dashboards/22146-flyte-user-dashboard-via-prometheus/average-finland-92144
04/09/2025, 10:57 PMrapid-artist-48509
04/11/2025, 1:50 AMTASK_RUNTIME
in the Timeline view is consistently showing like "-1hr" for a task that took about 1-2 sec to run, and perhaps 1-3 minutes full wallclock time (e.g. slow image pull).
I like the idea of Grafana for the cluster, but that was really hoping the Flyte WebUI could just show if some execution is "later in the queue".
Thanks for "fail fast" link, but yeah we definitely want to have a long queue of stuff to compute, like a CI system.
Thanks again for all the pointers!! I guess none of this would change if we switch from the single-binary / flyte-binary helm chart to the multi-chart deployment?rapid-artist-48509
04/21/2025, 4:31 PMPending
if any or all of the containers / pods associated with the execution are effectively queued?
Secondly: is there some configuration of Flyte that would exercise the case of several executions getting queued / held up by the propeller for several minutes before sending to k8s, to at least test that behavior? (Thus far I've never seen this happen except in map_task
internally) ...