This message was deleted Flyte #ray-integration

Join Slack

This message was deleted.

# ray-integration

user

06/01/2023, 12:41 PM

This message was deleted.

freezing-airport-6809

06/01/2023, 1:44 PM

Did you deploy the backend plugin and operator for ray

tall-exabyte-99685

06/01/2023, 2:21 PM

Yep, here are the relevant parts of my helmfile.yaml:

Copy code

releases:
- name: flyte
  namespace: flyte-{{ .Environment.Name }}
  chart: flyte/flyte-binary
  version: 1.6.0
  inherit:
  - template: releaseDefault

# Ray operator is installed globally on the k8s cluster
- name: ray-operator
  namespace: default
  chart: kuberay/kuberay-operator
  version: 0.5.0
# Create a static ray cluster per environment
- name: ray-cluster
  namespace: ray-cluster-{{ .Environment.Name }}
  chart: kuberay/ray-cluster
  version: 0.5.0

Strangely, other workflows that involve ray do get past this point and run, but they keep running perpetually and there are no logs, even when I add the IP address of the ray cluster to the `ray_config`:

Copy code

ray_config = RayJobConfig(
  address="###.###.##.###:6379",
  worker_node_config=[WorkerNodeConfig(group_name="batt-temp-trend", replicas=2)],
)

Checking the logs for the ray-cluster and ray-operator things seem to be running fine

ray-operator

logs show a lot of this:

Copy code

2023-06-01T15:18:16.710Z        INFO    controllers.RayJob      reconciling RayJob      {"NamespacedName": "project-domain/a86dnj5qj884fnvh8jll-n0-0"}
2023-06-01T15:18:16.710Z        INFO    controllers.RayJob      RayJob associated rayCluster found      {"rayjob": "a86dnj5qj884fnvh8jll-n0-0", "raycluster": "project-domain/a86dnj5qj884fnvh8jll-n0-0-raycluster-mv4vj"}
2023-06-01T15:18:16.711Z        INFO    controllers.RayJob      waiting for the cluster to be ready     {"rayCluster": "a86dnj5qj884fnvh8jll-n0-0-raycluster-mv4vj"}

ray-cluster

logs show this:

Copy code

2023-05-23 13:26:42,336 INFO usage_lib.py:399 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See <https://docs.ray.io/en/master/cluster/usage-stats.html> for more details.
2023-05-23 13:26:42,337 INFO scripts.py:710 -- Local node IP: ###.###.##.###
2023-05-23 13:26:45,116 SUCC scripts.py:747 -- --------------------
2023-05-23 13:26:45,116 SUCC scripts.py:748 -- Ray runtime started.
2023-05-23 13:26:45,116 SUCC scripts.py:749 -- --------------------
2023-05-23 13:26:45,116 INFO scripts.py:751 -- Next steps
2023-05-23 13:26:45,116 INFO scripts.py:754 -- To add another node to this Ray cluster, run
2023-05-23 13:26:45,117 INFO scripts.py:762 --   ray start --address='192.168.34.207:6379'
2023-05-23 13:26:45,117 INFO scripts.py:766 -- To connect to this Ray cluster:
2023-05-23 13:26:45,117 INFO scripts.py:768 -- import ray
2023-05-23 13:26:45,117 INFO scripts.py:776 -- ray.init()
2023-05-23 13:26:45,117 INFO scripts.py:781 -- To submit a Ray job using the Ray Jobs CLI:
2023-05-23 13:26:45,117 INFO scripts.py:788 --   RAY_ADDRESS='<http://192.168.34.207:8265>' ray job submit --working-dir . -- python my_script.py
2023-05-23 13:26:45,117 INFO scripts.py:792 -- See <https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html> 
2023-05-23 13:26:45,117 INFO scripts.py:796 -- for more information on submitting Ray jobs to the Ray cluster.
2023-05-23 13:26:45,117 INFO scripts.py:800 -- To terminate the Ray runtime, run
2023-05-23 13:26:45,117 INFO scripts.py:801 --   ray stop
2023-05-23 13:26:45,117 INFO scripts.py:804 -- To view the status of the cluster, use
2023-05-23 13:26:45,117 INFO scripts.py:805 --   ray status
2023-05-23 13:26:45,117 INFO scripts.py:809 -- To monitor and debug Ray, view the dashboard at 
2023-05-23 13:26:45,117 INFO scripts.py:812 --   ###.###.##.###:8265
2023-05-23 13:26:45,117 INFO scripts.py:819 -- If connection to the dashboard fails, check your firewall settings and network configuration.
2023-05-23 13:26:45,118 INFO scripts.py:917 -- --block
2023-05-23 13:26:45,118 INFO scripts.py:919 -- This command will now block forever until terminated by a signal.
2023-05-23 13:26:45,118 INFO scripts.py:922 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.

glamorous-carpet-83516

06/01/2023, 6:42 PM

did you build an image for your ray task?

tall-exabyte-99685

06/02/2023, 2:15 PM

Yes, I've been building an image for my ray tasks and pushing it to an AWS repository, and registering the workflows with that image. I just tried to build that example image you linked and I'm still getting the same result. Jobs Queue and then get to Running status but there are no logs to indicate why they are stalling and not completing.

167 Views

Open in Slack

Previous Next