gorgeous-beach-23305
05/29/2023, 5:52 PMinline
section of the configuration
in values.yaml.
configuration:
inline:
configmap:
enabled_plugins:
# -- Task specific configuration [structure](<https://pkg.go.dev/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/config#GetConfig>)
tasks:
# -- Plugins configuration, [structure](<https://pkg.go.dev/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/config#TaskPluginConfig>)
task-plugins:
# -- [Enabled Plugins](<https://pkg.go.dev/github.com/flyteorg/flyteplugins/go/tasks/config#Config>). Enable SageMaker*, Athena if you install the backend
# plugins
enabled-plugins:
- container
- sidecar
- k8s-array
- ray
default-for-task-types:
container: container
sidecar: sidecar
container_array: k8s-array
ray: ray
I have all the ray pods running -
NAME READY STATUS RESTARTS AGE
flyte-flyte-binary-6cfdcfc575-9l42x 1/1 Running 0 3d2h
flyte-ray-cluster-kuberay-head-9q6jq 1/1 Running 0 147m
flyte-ray-cluster-kuberay-worker-workergroup-bts8b 1/1 Running 0 147m
kuberay-apiserver-d7bbb9864-htsw4 1/1 Running 0 97m
kuberay-operator-55c84695b8-vftmn 1/1 Running 0 11h
And also all the services -
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
flyte-flyte-binary-grpc ClusterIP x.x.x.x. <none> 8089/TCP 3d3h
flyte-flyte-binary-http ClusterIP x.x.x.x. <none> 8088/TCP 3d3h
flyte-flyte-binary-webhook ClusterIP x.x.x.x. <none> 443/TCP 3d3h
flyte-ray-cluster-kuberay-head-svc ClusterIP x.x.x.x. <none> 10001/TCP,6379/TCP,8265/TCP,8080/TCP,8000/TCP 166m
kuberay-apiserver-service NodePort x.x.x.x. <none> 8888:31888/TCP,8887:31887/TCP 116m
kuberay-operator ClusterIP x.x.x.x. <none> 8080/TCP 3d2h
Questions:
1. Have I configured flyte to use ray correctly using the configmap in values.yaml?
2. How do I verify that the ray task that Flyte says was successful was indeed run on a ray cluster?gorgeous-beach-23305
05/29/2023, 8:08 PMtask
section under inline
like so -
tasks:
task-plugins:
enabled-plugins:
- container
- sidecar
- K8S-ARRAY
- ray
default-for-task-types:
- container: container
- container_array: K8S-ARRAY
- ray: ray
as shown in the eks-production.yaml and I am getting the following error in the flyte-binary pod -
E0529 19:46:55.461000 7 reflector.go:138] pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/reflector.go:167: Failed to watch *v1alpha1.RayJob: failed to list *v1alpha1.RayJob: <http://rayjobs.ray.io|rayjobs.ray.io> is forbidden: User "system:serviceaccount:flyte:flyte-svc" cannot list resource "rayjobs" in API group "<http://ray.io|ray.io>" at the cluster scope
W0529 19:47:26.035232 7 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/reflector.go:167: failed to list *v1alpha1.RayJob: <http://rayjobs.ray.io|rayjobs.ray.io> is forbidden: User "system:serviceaccount:flyte:flyte-svc" cannot list resource "rayjobs" in API group "<http://ray.io|ray.io>" at the cluster scope
Do I need to annotate the flyte service with something to enable it to access resource "rayjobs" in API group "ray.io"?gorgeous-beach-23305
05/30/2023, 3:22 AMtall-lock-23197
Ok, so may be I had the config in the wrong place. I added aI don't notice any differences in the configuration. Are you seeing the service account issue after modifying the configuration? What have you modified precisely?section undertask
like soinline
gorgeous-beach-23305
05/30/2023, 7:22 AMtasks
section inside the configmap
which I had provided in the inline
section of configuration
in values.yaml following example here. After this I was able to run a ray task with the sample provided on that page, however I could not find any logs of flyte connecting to ray to launch a cluster etc anywhere.
I then moved the tasks
section directly under inline
as shown in the eks-production.yaml that is when I am seeing the service account error in the logs.
I am using ray helm charts to install it, can you please also confirm if we need to install the ray-operator, ray-cluster and api-server as well?glamorous-carpet-83516
05/30/2023, 1:47 PMgorgeous-beach-23305
05/30/2023, 3:26 PMglamorous-carpet-83516
05/30/2023, 5:56 PMtall-exabyte-99685
05/31/2023, 11:09 PMClusterRole
and ClusterRoleBinding
did the trick to resolve that error (it may have just been the ClusterRole
that fixed it, not sure).