Hello! Nice to meet you all :slightly_smiling_face...
# ask-the-community
g
Hello! Nice to meet you all 🙂 I was looking through the documentation and was not able to find much information about the following: • Let's assume that I want to create a setup where I have one flyte control plane and multiple flyte data planes, across multiple physical clusters. Does flyte support out of the box quota management for different teams/projects somehow? How would that work across different clusters? Also, is there already an observability stack to check the resource usage? Thanks!
v
Hello Flyte does let you configure resource usage quotas per project, here’s an example of one way to achieve this using flytectl cli: https://docs.flyte.org/en/latest/deployment/configuration/general.html The installation guide for multi-cluster deployments is here: https://docs.flyte.org/en/latest/deployment/deployment/multicluster.html When you register and execute workflows, you have to specify which “project” and “domain” to use. I haven’t set up multi-cluster deployments myself yet, but I’d expect that the project-wide quota would still apply, regardless of which cluster the executions are hosted on. The multicluster guide explains how it is decided which cluster an execution should be scheduled onto based on labels
g
Thanks! So basically the control plane would load balance across the data planes, ensuring that the quotas are respected (based on projects), right? What about observability of the occupancy of the clusters?
(I also saw that you can pin projects/workflows to a cluster)
j
wrt to monitoring, you can use any existing k8-native infra you might have in place to monitor task resource usage (e.g Stackdriver, Cloudwatch, Prometheus, Grafana, Datadog, etc.) and/or logs (e.g Stackdriver, Cloudwatch, Datadog, Loki, etc). that is outside the scope of Flyte. Flyte also exposes prometheus metrics for its own internal metrics.
g
Thanks @jeev, was just wondering if flyte had already something out of the box to monitor the cluster usage
j
Not sure about Flyte-internal cluster occupancy metrics unfortunately, but this can be achieved with kube-state-metrics + Prometheus.
g
Thanks. I saw that quotas can be specified for cpus and memory. Can that be done also for GPUs (count)?
v
I checked how this could be achieved for GPUs, and this section seems relevant: https://docs.flyte.org/en/latest/deployment/configuration/general.html#cluster-resources The quotas on projects use the kubernetes
ResourceQuota
resource, which is templated here: https://github.com/flyteorg/flyte/blob/master/charts/flyte-core/values.yaml#L877 According to the cluster-resources section in the configuration guide, you can create custom attributes, which can then be passed to a custom template specified in the values.yaml (assuming helm installation)
j
Yes I believe so. Anything that k8s supports in its ResourceQuota object. https://kubernetes.io/docs/concepts/policy/resource-quotas/
v
So in your case that would be something like
Copy code
- key: ab_project_resource_quota
      value: |
        apiVersion: v1
        kind: ResourceQuota
        metadata:
          name: project-quota
          namespace: {{ namespace }}
        spec:
          hard:
            limits.cpu: {{ projectQuotaCpu }}
            limits.memory: {{ projectQuotaMemory }}
            <http://limits.nvidia.com/gpu|limits.nvidia.com/gpu>: {{ projectQuotaGpu }} # this is the added line, the rest is from the default values
and
Copy code
attributes:
    projectQuotaCpu: "1000"
    projectQuotaMemory: 5Ti
    projectQuotaGpu: "100" # this is the added custom attribute
domain: development
project: flyteexamples
with
Copy code
flytectl update cluster-resource-attribute --attrFile cra.yaml
g
Amazing, thank you all for your help. Just wanted to double check that this assumption is correct:
Copy code
So basically the control plane would load balance across the data planes, ensuring that the quotas are respected (based on projects), right?
Additional question about this: I saw that flyte supports the yunikorn scheduler; that scheduler has support for gang-scheduling, hierarchical queues, fair share, etc. . If we were to use that, I'm assuming we would just need to configure that directly in each data plane cluster and that operates separately from their existing project quotas?
j
i suspect that would still respect project quotas - this is enforced by the k8s control plane. where did you hear about flyte support for yunikorn?
g
From here
j
ah ok. if im reading that correctly, that's for the kubeflow training operator. gang scheduling actually makes sense there.
i think there is definitely interest in integrating with yunikorn natively from flyte for queues and fair share, but i don't believe this is available out of the box now.
g
fair, that's ok 🙂. Just wanted to understand what we can expect from the scheduling side of things. Thanks!