https://flyte.org logo
#ask-the-community
Title
# ask-the-community
f

Ferdinand von den Eichen

02/22/2023, 12:43 PM
How is the role setup meant to work in a multi cluster environment. Let’s consider the task that was scheduled on our data plane cluster (see screenshot below). 1. The only way we could make it work successfully is by allowing the data plane cluster to assume the user role of the control cluster (role that ends with MOUzPYyR), to download the metadata from that bucket. Is that intended? 2. We also couldn’t really overwrite this role. Even if we set other roles in the UI on the control cluster, the task always used this one 3. Finally, how restrictive are execution labels really? We set a project + domain to only run on cluster3. But when that cluster is not connected, Flyte seems to schedule the workflows somewhere (randomly?). Can this behaviour be prevented?
s

Samhita Alla

02/27/2023, 9:14 AM
cc @Yee / @Ketan (kumare3)
k

Ketan (kumare3)

02/27/2023, 3:42 PM
1. Is needed today, as metadata bucket is assumed shared. But we instead recommend using eks oidc - service accounts 2. overwrite should work. How did you do this? Serviceaccount are only supported now. Assume Iam role is deprecated 3. yes. You will have to either mark some projects for some clusters or workflows. cc @katrina for labels - can you share an example to exclude
k

katrina

02/27/2023, 5:18 PM
Re: 3 if the cluster is not connected then I believe the intent was to make sure the execution gets scheduled eventually so we disregard the placement rule since there's no corresponding cluster to manage the execution
k

Ketan (kumare3)

02/27/2023, 7:30 PM
@Ferdinand von den Eichen so if you look at the deployment docs - https://docs.flyte.org/en/latest/deployment/deployment/multicluster.html#user-and-control-plane-deployment, clusters can have weights and can be disabled. I guess when you say not connected I am assuming you mean
disabled
or
removed from the config
. If so Flyte thinks there is no rule and so schedules it randomly
Again this is the current setting, feel free to propose a modification,
f

Ferdinand von den Eichen

02/28/2023, 7:46 AM
Thanks for 1. and 2., that clarifies it. On 3.: Ok, understood. Let me outline our use case: • We have 20+ AWS accounts that we want to execute runs in via data plane clusters • A few central ones that are always up, with control plane (think dev stage prod) • When a workflow get’s scheduled to a subaccount we want to publish an SNS event and create the data plane cluster on demand in the subaccount So as you can see, clusters will not be available at schedule time, so the runs get scheduled randomly on a data plane that has no access to the correct AWS account or data. @katrina / @Ketan (kumare3) can you think of any trick to achieve what we want? Otherwise is it possible to extend / overwrite the default scheduling rules? Basically we would just need it to do nothing instead of scheduling randomly. For example this section of the docs seems promising….
Follow up: Removing lines 139-141 from the random cluster selector would achieve the behaviour we want, don’t you think? I’m referring to this file: https://github.com/flyteorg/flyteadmin/blob/c09eb7d42da6b8f6fc4e53a70b06d3540f99f426/pkg/executioncluster/impl/random_cluster_selector.go#L139 Far from an ideal solution (since I really, really would prefer to avoid forking the code, but…)
I tried a few things now, but I think I underestimated the complexity of the lifecycle of applying workflows remotely. Whenever I make major changes to the code I linked above that prevent a cluster from being picked (randomly), the code fails at
pyflyte run
time already. So maybe it is time to challenge some assumptions (I hope you can forgive me for the ping @katrina @David Espejo (he/him) and @Ketan (kumare3) - you have been so helpful in the past. 🫶). • I was hoping that a workflow can be applied, EVEN when the cluster that shall run it is not available at that time. For example I was able to apply --remote runs when the control plane had 0 data plane clusters connected. Once the data plane cluster came online, runs would get scheduled. I saw that there was a concept for retries, which gave me hope that this would work 😅 • More directly phrased, can one of you explain (or point me towards docs) if there is some kind of persistence layer for runs, before they get scheduled to data plane clusters / propeller
k

Ketan (kumare3)

03/02/2023, 3:08 PM
There is no such persistence layer in Flyte as it would add additional complexity
71 Views