Hi :slightly_smiling_face: Has anyone successfully...
# flyte-deployment
v
Hi 🙂 Has anyone successfully enabled multi-cluster support after doing a single-cluster deployment? Did you have any downtime or any surprising migration work to do? I'm interested in any learnings there as we're trying to decide if doing a multi-cluster deployment right away could save us from some pain down the line
It seems like doing so would just mean moving where
flyteadmin
and
flytepropeller
ither pods are running, which shouldn't change much
k
i do not think there will be user facing downtime, it might be that registration may not be available for some time
j
We are successfully running a flyte multi cluster setup on aws. We did not experience any surprising blocker while setting it up but if i remember correctly, there are some little configurations here and there that are not mentioned in the docs. Happy to help you out, if you want to go for multi cluster aswell 🙂
v
Thanks @Jan Fiedler! I might start doing it today or tomorrow (on Azure). I'll let you know!
j
Uh exiting! I also plan to launch flyte on azure soon - lets stay in touch 😉
k
There are many other folks runners on azure - but recently @Gopal Vashishtha found some problem - but that may be sovereign clouds
v
What were the issues?
g
The base URL of blob store was hardcoded to core.windows.net which works for public (Azure commercial) but not for sovereign clouds like US Gov
Will likely not affect you
t
I'm currently trying to get the multi-cluster deployment working (I'm a colleague of @Victor Delépine). I'm working from https://docs.flyte.org/en/latest/deployment/deployment/multicluster.html#id2 but I'm having some issues. I managed to solve a couple of problems where the
sync-cluster-resources (init)
init container of the
flyteadmin
pod. It kept failing to connect to my data plane only cluster but that is workign now. Now the main flyteadmin container is failing with
Copy code
caught panic: entries is empty [goroutine 1 [running]:
runtime/debug.Stack()
        /usr/local/go/src/runtime/debug/stack.go:24 +0x65
<http://github.com/flyteorg/flyteadmin/pkg/rpc/adminservice.NewAdminServer.func1()|github.com/flyteorg/flyteadmin/pkg/rpc/adminservice.NewAdminServer.func1()>
        /go/src/github.com/flyteorg/flyteadmin/pkg/rpc/adminservice/base.go:75 +0x88
panic({0x237d3c0, 0xc000b658c0})
        /usr/local/go/src/runtime/panic.go:884 +0x212
<http://github.com/flyteorg/flyteadmin/pkg/executioncluster/impl.GetExecutionCluster(|github.com/flyteorg/flyteadmin/pkg/executioncluster/impl.GetExecutionCluster(>{0x2d78ac8?, 0xc0006062b0?}, {0x0, 0x0}, {0x0, 0x0}, {0x2d6af08, 0xc00034c0a0}, {0x2d6f108, 0xc001686600})
        /go/src/github.com/flyteorg/flyteadmin/pkg/executioncluster/impl/factory.go:28 +0x159
<http://github.com/flyteorg/flyteadmin/pkg/rpc/adminservice.NewAdminServer({0x2d5efd0|github.com/flyteorg/flyteadmin/pkg/rpc/adminservice.NewAdminServer({0x2d5efd0>?, 0xc000114000}, 0xc0005a9950, {0x2d6af08, 0xc00034c0a0}, {0x0, 0x0}, {0x0, 0x0}, 0xc000593b00, ...)
        /go/src/github.com/flyteorg/flyteadmin/pkg/rpc/adminservice/base.go:89 +0x376
<http://github.com/flyteorg/flyteadmin/pkg/server.newGRPCServer({0x2d5efd0|github.com/flyteorg/flyteadmin/pkg/server.newGRPCServer({0x2d5efd0>, 0xc000114000}, 0xc0005a9950, 0xc00019c000, 0x0?, {0x0?, 0x0}, {0x2d78ac8, 0xc000ec1f00}, {0x0, ...})
        /go/src/github.com/flyteorg/flyteadmin/pkg/server/service.go:116 +0x6d9
<http://github.com/flyteorg/flyteadmin/pkg/server.serveGatewayInsecure(|github.com/flyteorg/flyteadmin/pkg/server.serveGatewayInsecure(>{0x2d5efd0?, 0xc000114000}, 0xc000ec1e10?, 0xc00019c000, 0xc0001c4680, 0x7fe3271ce108?, 0x9?, {0x2d78ac8, 0xc000ec1f00})
        /go/src/github.com/flyteorg/flyteadmin/pkg/server/service.go:319 +0x705
<http://github.com/flyteorg/flyteadmin/pkg/server.Serve({0x2d5efd0|github.com/flyteorg/flyteadmin/pkg/server.Serve({0x2d5efd0>, 0xc000114000}, 0x4?, 0x4?)
        /go/src/github.com/flyteorg/flyteadmin/pkg/server/service.go:59 +0x19f
<http://github.com/flyteorg/flyteadmin/cmd/entrypoints.glob..func7(0x41e0f20|github.com/flyteorg/flyteadmin/cmd/entrypoints.glob..func7(0x41e0f20>?, {0x27d0389?, 0x2?, 0x2?})
        /go/src/github.com/flyteorg/flyteadmin/cmd/entrypoints/serve.go:39 +0x128
<http://github.com/spf13/cobra.(*Command).execute(0x41e0f20|github.com/spf13/cobra.(*Command).execute(0x41e0f20>, {0xc00058d5c0, 0x2, 0x2})
        /go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:856 +0x67c
<http://github.com/spf13/cobra.(*Command).ExecuteC(0x41e20a0)|github.com/spf13/cobra.(*Command).ExecuteC(0x41e20a0)>
        /go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:974 +0x3bd
<http://github.com/spf13/cobra.(*Command).Execute(...)|github.com/spf13/cobra.(*Command).Execute(...)>
        /go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:902
<http://github.com/flyteorg/flyteadmin/cmd/entrypoints.Execute(0x60?)|github.com/flyteorg/flyteadmin/cmd/entrypoints.Execute(0x60?)>
        /go/src/github.com/flyteorg/flyteadmin/cmd/entrypoints/root.go:50 +0x3a
main.main()
        /go/src/github.com/flyteorg/flyteadmin/cmd/main.go:11 +0x85
]","ts":"2023-08-31T10:08:35Z
Has anyone seen this before in multi-cluster setups? I'm using latest helm release (1.9.1) and fltyadmin 1.1.123.
Ok, this one was my failure to read the instructions properly. Unsurprisingly setting
enabled: true
is important. It might be nice to add this to the example helm values so that others don't make the same mistake as me.
Hello I managed to get a multi-cluster deployment working but there were a few things that weren't documented and I had particular difficulty with the
cluster_resource_manager
(both the init container on flyteadmin and the separate deployment). The multi-cluster setup requires mounting a couple of secrets for authenticating to the kube api of the other k8s clusters. The docs explain using
additionalVolumes
and
additionalVolumeMounts
https://docs.flyte.org/en/latest/deployment/deployment/multicluster.html#user-and-control-plane-deployment. However these configs don't impact the resource sync deployment or the resource sync init container, so I got lots of errors about failing to find secrets on those components. Am I right in thinking that
cluster_resource_manager
is not supported on multi-cluster deployments? As far as I can tell the
cluster_resource_manager
is just responsible for creating k8s namespaces and applying namespace resource quotas. If that's true then personally I'm happy managing that myself in terraform.
k
Cc @Giacomo Dabisias IT / @Jan Fiedler and few Other folks you are all Running multi cluster - would be great to add your learnings somewhere
d
I'm building a multicluster environment and capturing learnings to update the docs. Also Giacomo is contributing to it Multiple rough edges, it's true
t
To summarise my findings that perhaps should be documented (unless I'm missing something): 1. Probably recommend disabling
cluster_resource_manager
due to secret mounting issues. 2. I think the service account
bearer token
and
ca.crt
it refers to are not created by autoamtically since k8s v1.22 https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#manual-secret-management-for-serviceaccounts. I had to create this manually. 3. You need to configure the flyteadmin endpoint on the data planes so they can communicate back to flyteadmin from the control plane. I did this with
configmap.admin.admin.endpoint
etc. 4.
enabled: true
is missing from
configmap.clusters.clusterConfigs
in the example at https://github.com/flyteorg/flyte/blob/44f5f42b3e5a75747932a6d76e8fc8fef625f3d2/charts/flyte-core/values.yaml.
v
I'd be curious to hear more about Tom's question here:
Am I right in thinking that
cluster_resource_manager
is not supported on multi-cluster deployments?
As far as I can tell the
cluster_resource_manager
is just responsible for creating k8s namespaces and applying namespace resource quotas. If that's true then personally I'm happy managing that myself in terraform.
We've had to disable the
cluster_resource_manager
and manage namespaces and service accounts manually via terraform for the moment
k
Yes cluster resource manager is optional
j
Okay let me try to summarise what we have done to also make the cluster_resource_manager work:
The multi-cluster setup requires mounting a couple of secrets for authenticating to the kube api of the other k8s clusters. The docs explain using
additionalVolumes
and
additionalVolumeMounts..
Yes thats correct. We also did configuration on the flyteadmin on
additionalVolumes
,
additionalVolumeMounts
and
initContainerClusterSyncAdditionalVolumeMounts
like this:
Copy code
additionalVolumes:
    - name: cluster-credentials
      secret:
        secretName: cluster-credentials
  additionalVolumeMounts:
    - name: cluster-credentials
      mountPath: /etc/credentials
  initContainerClusterSyncAdditionalVolumeMounts:
    - name: cluster-credentials
      mountPath: /etc/credentials
Like Tom already mentioned this alone does not impact the failing deployments. This is why adjusted the deployments for flyteadmin and cluster resource manager. After helm install of the control plane, creating an empty secret
cluster-credentials
should turn everything healthy. Will paste the adjusted deployments below ⬇️. Search for
Kineo Change
cluster_resource_sync_deployment.yaml,flyteadmin_deployment.yaml
d
Hi: Thanks to the incredible amount of time spent working on a multicluster setup and the great feedback from you all plus reviews and PRs from @Giacomo Dabisias IT, here's a PR that aims to update the multi-cluster docs https://github.com/flyteorg/flyte/pull/3994 Live build: https://flyte--3994.org.readthedocs.build/en/3994/deployment/deployment/multicluster.html Your reviews/comments are appreciated. @Thomas Newton thanks for your feedback. The only thing I couldn't reproduce was the issue with the
clusterconfigs
. I find that the
clusterResourceManager
and secrets mounting works just well with multicluster after this PR. Sometimes, though,
flyteadmin
doesn't reload the configmap after
helm upgrade
operations, forcing to do a rollout restart to make it load the new config. If you find this behavior consistent, please create an Issue to explore it better.
t
Thanks 🙌. It looks like https://github.com/flyteorg/flyte/pull/3993 addresses precisely the problem I was having with the cluster_resource_manager