This message was deleted.
# ray-integration
s
This message was deleted.
k
Did you deploy the backend plugin and operator for ray
p
Yep, here are the relevant parts of my helmfile.yaml:
Copy code
releases:
- name: flyte
  namespace: flyte-{{ .Environment.Name }}
  chart: flyte/flyte-binary
  version: 1.6.0
  inherit:
  - template: releaseDefault

# Ray operator is installed globally on the k8s cluster
- name: ray-operator
  namespace: default
  chart: kuberay/kuberay-operator
  version: 0.5.0
# Create a static ray cluster per environment
- name: ray-cluster
  namespace: ray-cluster-{{ .Environment.Name }}
  chart: kuberay/ray-cluster
  version: 0.5.0
Strangely, other workflows that involve ray do get past this point and run, but they keep running perpetually and there are no logs, even when I add the IP address of the ray cluster to the `ray_config`:
Copy code
ray_config = RayJobConfig(
  address="###.###.##.###:6379",
  worker_node_config=[WorkerNodeConfig(group_name="batt-temp-trend", replicas=2)],
)
Checking the logs for the ray-cluster and ray-operator things seem to be running fine
ray-operator
logs show a lot of this:
Copy code
2023-06-01T15:18:16.710Z        INFO    controllers.RayJob      reconciling RayJob      {"NamespacedName": "project-domain/a86dnj5qj884fnvh8jll-n0-0"}
2023-06-01T15:18:16.710Z        INFO    controllers.RayJob      RayJob associated rayCluster found      {"rayjob": "a86dnj5qj884fnvh8jll-n0-0", "raycluster": "project-domain/a86dnj5qj884fnvh8jll-n0-0-raycluster-mv4vj"}
2023-06-01T15:18:16.711Z        INFO    controllers.RayJob      waiting for the cluster to be ready     {"rayCluster": "a86dnj5qj884fnvh8jll-n0-0-raycluster-mv4vj"}
ray-cluster
logs show this:
Copy code
2023-05-23 13:26:42,336 INFO usage_lib.py:399 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See <https://docs.ray.io/en/master/cluster/usage-stats.html> for more details.
2023-05-23 13:26:42,337 INFO scripts.py:710 -- Local node IP: ###.###.##.###
2023-05-23 13:26:45,116 SUCC scripts.py:747 -- --------------------
2023-05-23 13:26:45,116 SUCC scripts.py:748 -- Ray runtime started.
2023-05-23 13:26:45,116 SUCC scripts.py:749 -- --------------------
2023-05-23 13:26:45,116 INFO scripts.py:751 -- Next steps
2023-05-23 13:26:45,116 INFO scripts.py:754 -- To add another node to this Ray cluster, run
2023-05-23 13:26:45,117 INFO scripts.py:762 --   ray start --address='192.168.34.207:6379'
2023-05-23 13:26:45,117 INFO scripts.py:766 -- To connect to this Ray cluster:
2023-05-23 13:26:45,117 INFO scripts.py:768 -- import ray
2023-05-23 13:26:45,117 INFO scripts.py:776 -- ray.init()
2023-05-23 13:26:45,117 INFO scripts.py:781 -- To submit a Ray job using the Ray Jobs CLI:
2023-05-23 13:26:45,117 INFO scripts.py:788 --   RAY_ADDRESS='<http://192.168.34.207:8265>' ray job submit --working-dir . -- python my_script.py
2023-05-23 13:26:45,117 INFO scripts.py:792 -- See <https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html> 
2023-05-23 13:26:45,117 INFO scripts.py:796 -- for more information on submitting Ray jobs to the Ray cluster.
2023-05-23 13:26:45,117 INFO scripts.py:800 -- To terminate the Ray runtime, run
2023-05-23 13:26:45,117 INFO scripts.py:801 --   ray stop
2023-05-23 13:26:45,117 INFO scripts.py:804 -- To view the status of the cluster, use
2023-05-23 13:26:45,117 INFO scripts.py:805 --   ray status
2023-05-23 13:26:45,117 INFO scripts.py:809 -- To monitor and debug Ray, view the dashboard at 
2023-05-23 13:26:45,117 INFO scripts.py:812 --   ###.###.##.###:8265
2023-05-23 13:26:45,117 INFO scripts.py:819 -- If connection to the dashboard fails, check your firewall settings and network configuration.
2023-05-23 13:26:45,118 INFO scripts.py:917 -- --block
2023-05-23 13:26:45,118 INFO scripts.py:919 -- This command will now block forever until terminated by a signal.
2023-05-23 13:26:45,118 INFO scripts.py:922 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
k
did you build an image for your ray task?
p
Yes, I've been building an image for my ray tasks and pushing it to an AWS repository, and registering the workflows with that image. I just tried to build that example image you linked and I'm still getting the same result. Jobs Queue and then get to Running status but there are no logs to indicate why they are stalling and not completing.
160 Views