Nandakumar Raghu
05/29/2023, 5:52 PMinline
section of the configuration
in values.yaml.
configuration:
inline:
configmap:
enabled_plugins:
# -- Task specific configuration [structure](<https://pkg.go.dev/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/config#GetConfig>)
tasks:
# -- Plugins configuration, [structure](<https://pkg.go.dev/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/config#TaskPluginConfig>)
task-plugins:
# -- [Enabled Plugins](<https://pkg.go.dev/github.com/flyteorg/flyteplugins/go/tasks/config#Config>). Enable SageMaker*, Athena if you install the backend
# plugins
enabled-plugins:
- container
- sidecar
- k8s-array
- ray
default-for-task-types:
container: container
sidecar: sidecar
container_array: k8s-array
ray: ray
I have all the ray pods running -
NAME READY STATUS RESTARTS AGE
flyte-flyte-binary-6cfdcfc575-9l42x 1/1 Running 0 3d2h
flyte-ray-cluster-kuberay-head-9q6jq 1/1 Running 0 147m
flyte-ray-cluster-kuberay-worker-workergroup-bts8b 1/1 Running 0 147m
kuberay-apiserver-d7bbb9864-htsw4 1/1 Running 0 97m
kuberay-operator-55c84695b8-vftmn 1/1 Running 0 11h
And also all the services -
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
flyte-flyte-binary-grpc ClusterIP x.x.x.x. <none> 8089/TCP 3d3h
flyte-flyte-binary-http ClusterIP x.x.x.x. <none> 8088/TCP 3d3h
flyte-flyte-binary-webhook ClusterIP x.x.x.x. <none> 443/TCP 3d3h
flyte-ray-cluster-kuberay-head-svc ClusterIP x.x.x.x. <none> 10001/TCP,6379/TCP,8265/TCP,8080/TCP,8000/TCP 166m
kuberay-apiserver-service NodePort x.x.x.x. <none> 8888:31888/TCP,8887:31887/TCP 116m
kuberay-operator ClusterIP x.x.x.x. <none> 8080/TCP 3d2h
Questions:
1. Have I configured flyte to use ray correctly using the configmap in values.yaml?
2. How do I verify that the ray task that Flyte says was successful was indeed run on a ray cluster?task
section under inline
like so -
tasks:
task-plugins:
enabled-plugins:
- container
- sidecar
- K8S-ARRAY
- ray
default-for-task-types:
- container: container
- container_array: K8S-ARRAY
- ray: ray
as shown in the eks-production.yaml and I am getting the following error in the flyte-binary pod -
E0529 19:46:55.461000 7 reflector.go:138] pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/reflector.go:167: Failed to watch *v1alpha1.RayJob: failed to list *v1alpha1.RayJob: <http://rayjobs.ray.io|rayjobs.ray.io> is forbidden: User "system:serviceaccount:flyte:flyte-svc" cannot list resource "rayjobs" in API group "<http://ray.io|ray.io>" at the cluster scope
W0529 19:47:26.035232 7 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/reflector.go:167: failed to list *v1alpha1.RayJob: <http://rayjobs.ray.io|rayjobs.ray.io> is forbidden: User "system:serviceaccount:flyte:flyte-svc" cannot list resource "rayjobs" in API group "<http://ray.io|ray.io>" at the cluster scope
Do I need to annotate the flyte service with something to enable it to access resource "rayjobs" in API group "ray.io"?Samhita Alla
Ok, so may be I had the config in the wrong place. I added aI don't notice any differences in the configuration. Are you seeing the service account issue after modifying the configuration? What have you modified precisely?section undertask
like soinline
Nandakumar Raghu
05/30/2023, 7:22 AMtasks
section inside the configmap
which I had provided in the inline
section of configuration
in values.yaml following example here. After this I was able to run a ray task with the sample provided on that page, however I could not find any logs of flyte connecting to ray to launch a cluster etc anywhere.
I then moved the tasks
section directly under inline
as shown in the eks-production.yaml that is when I am seeing the service account error in the logs.
I am using ray helm charts to install it, can you please also confirm if we need to install the ray-operator, ray-cluster and api-server as well?Kevin Su
05/30/2023, 1:47 PMNandakumar Raghu
05/30/2023, 3:26 PMKevin Su
05/30/2023, 5:56 PMPeter Klingelhofer
05/31/2023, 11:09 PMClusterRole
and ClusterRoleBinding
did the trick to resolve that error (it may have just been the ClusterRole
that fixed it, not sure).