Hi slightly smiling face Has anyone successfully enabled <ht Flyte #flyte-deployment

Hi :slightly_smiling_face: Has anyone successfully...

flat-exabyte-79377

08/15/2023, 5:52 PM

Hi 🙂 Has anyone successfully enabled multi-cluster support after doing a single-cluster deployment? Did you have any downtime or any surprising migration work to do? I'm interested in any learnings there as we're trying to decide if doing a multi-cluster deployment right away could save us from some pain down the line

flat-exabyte-79377

08/15/2023, 5:53 PM

It seems like doing so would just mean moving where

flyteadmin

and ~~flytepropeller~~
ither pods are running, which shouldn't change much

freezing-airport-6809

08/15/2023, 6:55 PM

i do not think there will be user facing downtime, it might be that registration may not be available for some time

🙏 1

faint-activity-87590

08/16/2023, 8:40 AM

We are successfully running a flyte multi cluster setup on aws. We did not experience any surprising blocker while setting it up but if i remember correctly, there are some little configurations here and there that are not mentioned in the docs. Happy to help you out, if you want to go for multi cluster aswell 🙂

flat-exabyte-79377

08/16/2023, 9:23 AM

Thanks @faint-activity-87590! I might start doing it today or tomorrow (on Azure). I'll let you know!

faint-activity-87590

08/16/2023, 9:25 AM

Uh exiting! I also plan to launch flyte on azure soon - lets stay in touch 😉

👍 2

freezing-airport-6809

08/16/2023, 1:23 PM

There are many other folks runners on azure - but recently @calm-zoo-68637 found some problem - but that may be sovereign clouds

flat-exabyte-79377

08/16/2023, 1:24 PM

What were the issues?

calm-zoo-68637

08/16/2023, 4:49 PM

The base URL of blob store was hardcoded to core.windows.net which works for public (Azure commercial) but not for sovereign clouds like US Gov

calm-zoo-68637

08/16/2023, 4:49 PM

Will likely not affect you

calm-pilot-2010

08/31/2023, 9:39 AM

I'm currently trying to get the multi-cluster deployment working (I'm a colleague of @flat-exabyte-79377). I'm working from https://docs.flyte.org/en/latest/deployment/deployment/multicluster.html#id2 but I'm having some issues. I managed to solve a couple of problems where the

sync-cluster-resources (init)

init container of the

flyteadmin

pod. It kept failing to connect to my data plane only cluster but that is workign now. Now the main flyteadmin container is failing with

Copy code

caught panic: entries is empty [goroutine 1 [running]:
runtime/debug.Stack()
        /usr/local/go/src/runtime/debug/stack.go:24 +0x65
<http://github.com/flyteorg/flyteadmin/pkg/rpc/adminservice.NewAdminServer.func1()|github.com/flyteorg/flyteadmin/pkg/rpc/adminservice.NewAdminServer.func1()>
        /go/src/github.com/flyteorg/flyteadmin/pkg/rpc/adminservice/base.go:75 +0x88
panic({0x237d3c0, 0xc000b658c0})
        /usr/local/go/src/runtime/panic.go:884 +0x212
<http://github.com/flyteorg/flyteadmin/pkg/executioncluster/impl.GetExecutionCluster(|github.com/flyteorg/flyteadmin/pkg/executioncluster/impl.GetExecutionCluster(>{0x2d78ac8?, 0xc0006062b0?}, {0x0, 0x0}, {0x0, 0x0}, {0x2d6af08, 0xc00034c0a0}, {0x2d6f108, 0xc001686600})
        /go/src/github.com/flyteorg/flyteadmin/pkg/executioncluster/impl/factory.go:28 +0x159
<http://github.com/flyteorg/flyteadmin/pkg/rpc/adminservice.NewAdminServer({0x2d5efd0|github.com/flyteorg/flyteadmin/pkg/rpc/adminservice.NewAdminServer({0x2d5efd0>?, 0xc000114000}, 0xc0005a9950, {0x2d6af08, 0xc00034c0a0}, {0x0, 0x0}, {0x0, 0x0}, 0xc000593b00, ...)
        /go/src/github.com/flyteorg/flyteadmin/pkg/rpc/adminservice/base.go:89 +0x376
<http://github.com/flyteorg/flyteadmin/pkg/server.newGRPCServer({0x2d5efd0|github.com/flyteorg/flyteadmin/pkg/server.newGRPCServer({0x2d5efd0>, 0xc000114000}, 0xc0005a9950, 0xc00019c000, 0x0?, {0x0?, 0x0}, {0x2d78ac8, 0xc000ec1f00}, {0x0, ...})
        /go/src/github.com/flyteorg/flyteadmin/pkg/server/service.go:116 +0x6d9
<http://github.com/flyteorg/flyteadmin/pkg/server.serveGatewayInsecure(|github.com/flyteorg/flyteadmin/pkg/server.serveGatewayInsecure(>{0x2d5efd0?, 0xc000114000}, 0xc000ec1e10?, 0xc00019c000, 0xc0001c4680, 0x7fe3271ce108?, 0x9?, {0x2d78ac8, 0xc000ec1f00})
        /go/src/github.com/flyteorg/flyteadmin/pkg/server/service.go:319 +0x705
<http://github.com/flyteorg/flyteadmin/pkg/server.Serve({0x2d5efd0|github.com/flyteorg/flyteadmin/pkg/server.Serve({0x2d5efd0>, 0xc000114000}, 0x4?, 0x4?)
        /go/src/github.com/flyteorg/flyteadmin/pkg/server/service.go:59 +0x19f
<http://github.com/flyteorg/flyteadmin/cmd/entrypoints.glob..func7(0x41e0f20|github.com/flyteorg/flyteadmin/cmd/entrypoints.glob..func7(0x41e0f20>?, {0x27d0389?, 0x2?, 0x2?})
        /go/src/github.com/flyteorg/flyteadmin/cmd/entrypoints/serve.go:39 +0x128
<http://github.com/spf13/cobra.(*Command).execute(0x41e0f20|github.com/spf13/cobra.(*Command).execute(0x41e0f20>, {0xc00058d5c0, 0x2, 0x2})
        /go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:856 +0x67c
<http://github.com/spf13/cobra.(*Command).ExecuteC(0x41e20a0)|github.com/spf13/cobra.(*Command).ExecuteC(0x41e20a0)>
        /go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:974 +0x3bd
<http://github.com/spf13/cobra.(*Command).Execute(...)|github.com/spf13/cobra.(*Command).Execute(...)>
        /go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:902
<http://github.com/flyteorg/flyteadmin/cmd/entrypoints.Execute(0x60?)|github.com/flyteorg/flyteadmin/cmd/entrypoints.Execute(0x60?)>
        /go/src/github.com/flyteorg/flyteadmin/cmd/entrypoints/root.go:50 +0x3a
main.main()
        /go/src/github.com/flyteorg/flyteadmin/cmd/main.go:11 +0x85
]","ts":"2023-08-31T10:08:35Z

Has anyone seen this before in multi-cluster setups? I'm using latest helm release (1.9.1) and fltyadmin 1.1.123.

calm-pilot-2010

08/31/2023, 10:34 AM

Ok, this one was my failure to read the instructions properly. Unsurprisingly setting

enabled: true

is important. It might be nice to add this to the example helm values so that others don't make the same mistake as me.

calm-pilot-2010

09/04/2023, 6:46 PM

Hello I managed to get a multi-cluster deployment working but there were a few things that weren't documented and I had particular difficulty with the

cluster_resource_manager

(both the init container on flyteadmin and the separate deployment). The multi-cluster setup requires mounting a couple of secrets for authenticating to the kube api of the other k8s clusters. The docs explain using

additionalVolumes

and

additionalVolumeMounts

https://docs.flyte.org/en/latest/deployment/deployment/multicluster.html#user-and-control-plane-deployment. However these configs don't impact the resource sync deployment or the resource sync init container, so I got lots of errors about failing to find secrets on those components. Am I right in thinking that

cluster_resource_manager

is not supported on multi-cluster deployments? As far as I can tell the

cluster_resource_manager

is just responsible for creating k8s namespaces and applying namespace resource quotas. If that's true then personally I'm happy managing that myself in terraform.

freezing-airport-6809

09/04/2023, 7:07 PM

Cc @dazzling-advantage-95256 / @faint-activity-87590 and few Other folks you are all Running multi cluster - would be great to add your learnings somewhere

average-finland-92144

09/04/2023, 9:24 PM

I'm building a multicluster environment and capturing learnings to update the docs. Also Giacomo is contributing to it Multiple rough edges, it's true

calm-pilot-2010

09/04/2023, 10:01 PM

To summarise my findings that perhaps should be documented (unless I'm missing something): 1. Probably recommend disabling

cluster_resource_manager

due to secret mounting issues. 2. I think the service account

bearer token

and

ca.crt

it refers to are not created by autoamtically since k8s v1.22 https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#manual-secret-management-for-serviceaccounts. I had to create this manually. 3. You need to configure the flyteadmin endpoint on the data planes so they can communicate back to flyteadmin from the control plane. I did this with

configmap.admin.admin.endpoint

etc. 4.

enabled: true

is missing from

configmap.clusters.clusterConfigs

in the example at https://github.com/flyteorg/flyte/blob/44f5f42b3e5a75747932a6d76e8fc8fef625f3d2/charts/flyte-core/values.yaml.

🙏 3

flat-exabyte-79377

09/05/2023, 12:17 PM

I'd be curious to hear more about Tom's question here:

Am I right in thinking that
cluster_resource_manager
is not supported on multi-cluster deployments?

As far as I can tell the
cluster_resource_manager
is just responsible for creating k8s namespaces and applying namespace resource quotas. If that's true then personally I'm happy managing that myself in terraform.

We've had to disable the

cluster_resource_manager

and manage namespaces and service accounts manually via terraform for the moment

freezing-airport-6809

09/05/2023, 1:17 PM

Yes cluster resource manager is optional

faint-activity-87590

09/05/2023, 1:35 PM

Okay let me try to summarise what we have done to also make the cluster_resource_manager work:

The multi-cluster setup requires mounting a couple of secrets for authenticating to the kube api of the other k8s clusters. The docs explain using
additionalVolumes
and
additionalVolumeMounts..

Yes thats correct. We also did configuration on the flyteadmin on

additionalVolumes

additionalVolumeMounts

and

initContainerClusterSyncAdditionalVolumeMounts

like this:

Copy code

additionalVolumes:
    - name: cluster-credentials
      secret:
        secretName: cluster-credentials
  additionalVolumeMounts:
    - name: cluster-credentials
      mountPath: /etc/credentials
  initContainerClusterSyncAdditionalVolumeMounts:
    - name: cluster-credentials
      mountPath: /etc/credentials

Like Tom already mentioned this alone does not impact the failing deployments. This is why adjusted the deployments for flyteadmin and cluster resource manager. After helm install of the control plane, creating an empty secret

cluster-credentials

should turn everything healthy. Will paste the adjusted deployments below ⬇️. Search for

Kineo Change

gratitude thank you 2

👀 1

faint-activity-87590

09/05/2023, 1:39 PM

cluster_resource_sync_deployment.yaml,flyteadmin_deployment.yaml

cluster_resource_sync_deployment.yaml flyteadmin_deployment.yaml

average-finland-92144

10/04/2023, 10:49 PM

Hi: Thanks to the incredible amount of time spent working on a multicluster setup and the great feedback from you all plus reviews and PRs from @dazzling-advantage-95256, here's a PR that aims to update the multi-cluster docs https://github.com/flyteorg/flyte/pull/3994 Live build: https://flyte--3994.org.readthedocs.build/en/3994/deployment/deployment/multicluster.html Your reviews/comments are appreciated. @calm-pilot-2010 thanks for your feedback. The only thing I couldn't reproduce was the issue with the

clusterconfigs

. I find that the

clusterResourceManager

and secrets mounting works just well with multicluster after this PR. Sometimes, though,

flyteadmin

doesn't reload the configmap after

helm upgrade

operations, forcing to do a rollout restart to make it load the new config. If you find this behavior consistent, please create an Issue to explore it better.

🔥 3

❤️ 3

calm-pilot-2010

10/05/2023, 8:35 AM

Thanks 🙌. It looks like https://github.com/flyteorg/flyte/pull/3993 addresses precisely the problem I was having with the cluster_resource_manager

12 Views

Open in Slack

Previous Next