Hi All, I am working on integrating Ray with Flyte...
# ray-integration
n
Hi All, I am working on integrating Ray with Flyte. I have been able to register and run the ray task and it completes successfully. But I am not able to find any logs anywhere saying that the task was run through ray. Also, I can't see any pods being created / destroyed. There is a ray cluster created, but it is also not destroyed after the task run. I have installed Ray operator, ray cluster and ray api-server using their helm charts. And I have added the configmap in the
inline
section of the
configuration
in values.yaml.
Copy code
configuration:
  inline:
    configmap:
      enabled_plugins:
        # -- Task specific configuration [structure](<https://pkg.go.dev/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/config#GetConfig>)
        tasks:
          # -- Plugins configuration, [structure](<https://pkg.go.dev/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/config#TaskPluginConfig>)
          task-plugins:
            # -- [Enabled Plugins](<https://pkg.go.dev/github.com/flyteorg/flyteplugins/go/tasks/config#Config>). Enable SageMaker*, Athena if you install the backend
            # plugins
            enabled-plugins:
              - container
              - sidecar
              - k8s-array
              - ray
            default-for-task-types:
              container: container
              sidecar: sidecar
              container_array: k8s-array
              ray: ray
I have all the ray pods running -
Copy code
NAME                                                 READY   STATUS    RESTARTS   AGE
flyte-flyte-binary-6cfdcfc575-9l42x                  1/1     Running   0          3d2h
flyte-ray-cluster-kuberay-head-9q6jq                 1/1     Running   0          147m
flyte-ray-cluster-kuberay-worker-workergroup-bts8b   1/1     Running   0          147m
kuberay-apiserver-d7bbb9864-htsw4                    1/1     Running   0          97m
kuberay-operator-55c84695b8-vftmn                    1/1     Running   0          11h
And also all the services -
Copy code
NAME                                 TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                         AGE
flyte-flyte-binary-grpc              ClusterIP   x.x.x.x.   <none>        8089/TCP                                        3d3h
flyte-flyte-binary-http              ClusterIP   x.x.x.x.   <none>        8088/TCP                                        3d3h
flyte-flyte-binary-webhook           ClusterIP   x.x.x.x.    <none>        443/TCP                                         3d3h
flyte-ray-cluster-kuberay-head-svc   ClusterIP   x.x.x.x.    <none>        10001/TCP,6379/TCP,8265/TCP,8080/TCP,8000/TCP   166m
kuberay-apiserver-service            NodePort    x.x.x.x.   <none>        8888:31888/TCP,8887:31887/TCP                   116m
kuberay-operator                     ClusterIP   x.x.x.x.    <none>        8080/TCP                                        3d2h
Questions: 1. Have I configured flyte to use ray correctly using the configmap in values.yaml? 2. How do I verify that the ray task that Flyte says was successful was indeed run on a ray cluster?
Ok, so may be I had the config in the wrong place. I added a
task
section under
inline
like so -
Copy code
tasks:
  task-plugins:
    enabled-plugins:
      - container
      - sidecar
      - K8S-ARRAY
      - ray
    default-for-task-types:
      - container: container
      - container_array: K8S-ARRAY
      - ray: ray
as shown in the eks-production.yaml and I am getting the following error in the flyte-binary pod -
Copy code
E0529 19:46:55.461000       7 reflector.go:138] pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/reflector.go:167: Failed to watch *v1alpha1.RayJob: failed to list *v1alpha1.RayJob: <http://rayjobs.ray.io|rayjobs.ray.io> is forbidden: User "system:serviceaccount:flyte:flyte-svc" cannot list resource "rayjobs" in API group "<http://ray.io|ray.io>" at the cluster scope
W0529 19:47:26.035232       7 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/reflector.go:167: failed to list *v1alpha1.RayJob: <http://rayjobs.ray.io|rayjobs.ray.io> is forbidden: User "system:serviceaccount:flyte:flyte-svc" cannot list resource "rayjobs" in API group "<http://ray.io|ray.io>" at the cluster scope
Do I need to annotate the flyte service with something to enable it to access resource "rayjobs" in API group "ray.io"?
@David Espejo (he/him) / @jeev any thoughts?
s
Ok, so may be I had the config in the wrong place. I added a
task
section under
inline
like so
I don't notice any differences in the configuration. Are you seeing the service account issue after modifying the configuration? What have you modified precisely?
n
So, earlier I had the
tasks
section inside the
configmap
which I had provided in the
inline
section of
configuration
in values.yaml following example here. After this I was able to run a ray task with the sample provided on that page, however I could not find any logs of flyte connecting to ray to launch a cluster etc anywhere. I then moved the
tasks
section directly under
inline
as shown in the eks-production.yaml that is when I am seeing the service account error in the logs. I am using ray helm charts to install it, can you please also confirm if we need to install the ray-operator, ray-cluster and api-server as well?
k
you need to install ray-operator
n
I have installed ray-operator, as you can see in the result from kubctl get pods, kuberay-operator-55c84695b8-vftmn is running.
p
I believe @Kevin Su is correct. I was running into this same issue yesterday and adding a
ClusterRole
and
ClusterRoleBinding
did the trick to resolve that error (it may have just been the
ClusterRole
that fixed it, not sure).
120 Views