# ray-integration


06/01/2023, 12:41 PM
This message was deleted.

Ketan (kumare3)

06/01/2023, 1:44 PM
Did you deploy the backend plugin and operator for ray

Peter Klingelhofer

06/01/2023, 2:21 PM
Yep, here are the relevant parts of my helmfile.yaml:
Copy code
- name: flyte
  namespace: flyte-{{ .Environment.Name }}
  chart: flyte/flyte-binary
  version: 1.6.0
  - template: releaseDefault

# Ray operator is installed globally on the k8s cluster
- name: ray-operator
  namespace: default
  chart: kuberay/kuberay-operator
  version: 0.5.0
# Create a static ray cluster per environment
- name: ray-cluster
  namespace: ray-cluster-{{ .Environment.Name }}
  chart: kuberay/ray-cluster
  version: 0.5.0
Strangely, other workflows that involve ray do get past this point and run, but they keep running perpetually and there are no logs, even when I add the IP address of the ray cluster to the `ray_config`:
Copy code
ray_config = RayJobConfig(
  worker_node_config=[WorkerNodeConfig(group_name="batt-temp-trend", replicas=2)],
Checking the logs for the ray-cluster and ray-operator things seem to be running fine
logs show a lot of this:
Copy code
2023-06-01T15:18:16.710Z        INFO    controllers.RayJob      reconciling RayJob      {"NamespacedName": "project-domain/a86dnj5qj884fnvh8jll-n0-0"}
2023-06-01T15:18:16.710Z        INFO    controllers.RayJob      RayJob associated rayCluster found      {"rayjob": "a86dnj5qj884fnvh8jll-n0-0", "raycluster": "project-domain/a86dnj5qj884fnvh8jll-n0-0-raycluster-mv4vj"}
2023-06-01T15:18:16.711Z        INFO    controllers.RayJob      waiting for the cluster to be ready     {"rayCluster": "a86dnj5qj884fnvh8jll-n0-0-raycluster-mv4vj"}
logs show this:
Copy code
2023-05-23 13:26:42,336 INFO -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See <> for more details.
2023-05-23 13:26:42,337 INFO -- Local node IP: ###.###.##.###
2023-05-23 13:26:45,116 SUCC -- --------------------
2023-05-23 13:26:45,116 SUCC -- Ray runtime started.
2023-05-23 13:26:45,116 SUCC -- --------------------
2023-05-23 13:26:45,116 INFO -- Next steps
2023-05-23 13:26:45,116 INFO -- To add another node to this Ray cluster, run
2023-05-23 13:26:45,117 INFO --   ray start --address=''
2023-05-23 13:26:45,117 INFO -- To connect to this Ray cluster:
2023-05-23 13:26:45,117 INFO -- import ray
2023-05-23 13:26:45,117 INFO -- ray.init()
2023-05-23 13:26:45,117 INFO -- To submit a Ray job using the Ray Jobs CLI:
2023-05-23 13:26:45,117 INFO --   RAY_ADDRESS='<>' ray job submit --working-dir . -- python
2023-05-23 13:26:45,117 INFO -- See <> 
2023-05-23 13:26:45,117 INFO -- for more information on submitting Ray jobs to the Ray cluster.
2023-05-23 13:26:45,117 INFO -- To terminate the Ray runtime, run
2023-05-23 13:26:45,117 INFO --   ray stop
2023-05-23 13:26:45,117 INFO -- To view the status of the cluster, use
2023-05-23 13:26:45,117 INFO --   ray status
2023-05-23 13:26:45,117 INFO -- To monitor and debug Ray, view the dashboard at 
2023-05-23 13:26:45,117 INFO --   ###.###.##.###:8265
2023-05-23 13:26:45,117 INFO -- If connection to the dashboard fails, check your firewall settings and network configuration.
2023-05-23 13:26:45,118 INFO -- --block
2023-05-23 13:26:45,118 INFO -- This command will now block forever until terminated by a signal.
2023-05-23 13:26:45,118 INFO -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.

Kevin Su

06/01/2023, 6:42 PM
did you build an image for your ray task?

Peter Klingelhofer

06/02/2023, 2:15 PM
Yes, I've been building an image for my ray tasks and pushing it to an AWS repository, and registering the workflows with that image. I just tried to build that example image you linked and I'm still getting the same result. Jobs Queue and then get to Running status but there are no logs to indicate why they are stalling and not completing.