Hey everyone We ve had an issue recently due to a large drop Flyte #flyte-support

Hey everyone. We've had an issue recently due to a...

big-notebook-82371

05/14/2024, 8:09 PM

Hey everyone. We've had an issue recently due to a large dropoff in availability of the main GPUs we were using on GCP, the L4s. We haven't been able to get any tasks that need them to start up, due to availability. I haven't been able to find a good way to change which machine type to request for a given task without re-registering the workflow. Are there any features that could help dynamically change machine type requests if a task isn't starting due to availability issues? Or what resources are there for trying to do something like that.

freezing-airport-6809

05/14/2024, 8:12 PM

this does not exist as a feature - you wll have to re-register. Another option is to change the config at the back

freezing-airport-6809

05/14/2024, 8:12 PM

just change the node selectors etc - from L4 to example T4

big-notebook-82371

05/14/2024, 8:18 PM

Ok, gotcha. Yeah, we're using node-pool selectors right now. Sounds like we'll probably just have to use smaller GPUs until L4s are more available

freezing-airport-6809

05/14/2024, 8:30 PM

ya just switch in propeller

freezing-airport-6809

05/14/2024, 8:30 PM

much easier

big-notebook-82371

05/14/2024, 8:32 PM

What does that look like? That's a different way from using pod templates?

freezing-airport-6809

05/14/2024, 9:23 PM

hmm there is pod templates, but also there is a gpu switch

freezing-airport-6809

05/14/2024, 9:23 PM

are you using accelerator?

big-notebook-82371

05/14/2024, 9:30 PM

I tried using them, and I remember I had issues getting it to work correctly, so I moved to using

node_selector={"<http://cloud.google.com/gke-nodepool|cloud.google.com/gke-nodepool>": "<node-pool-name"},

instead. I wish I remembered why it didn't work, I can't find my note on it right now

5 Views

Open in Slack

Previous Next