Padma Priya M
12/22/2022, 1:52 PMPadma Priya M
01/13/2023, 4:33 AMpyflyte --config ~/.flyte/config-remote.yaml run --remote --image <image_name> ray_demo.py wf
, I am getting this issue in logs and the task is getting queued in the console. When the same is executed in local using pyflyte --config ~/.flyte/config-remote.yaml run --image <image_name> ray_demo.py wf
, it works fine.Padma Priya M
01/16/2023, 5:48 AMRuksana Kabealo
01/30/2023, 8:42 PMMarcin Zieminski
02/23/2023, 8:55 PM2023-02-23T18:08:48.386Z INFO controllers.RayJob RayJob associated rayCluster found {"rayjob": "a8xqtvnds2sp7bgkn96k-fzvfpg5y-0", "raycluster": "ntropy-development/a8xqtvnds2sp7bgkn96k-fzvfpg5y-0-raycluster-xlg4f"}
2023-02-23T18:08:48.387Z INFO controllers.RayJob waiting for the cluster to be ready {"rayCluster": "a8xqtvnds2sp7bgkn96k-fzvfpg5y-0-raycluster-xlg4f"}
2023-02-23T18:08:51.387Z INFO controllers.RayJob reconciling RayJob {"NamespacedName": "ntropy-development/a8xqtvnds2sp7bgkn96k-fzvfpg5y-0"}
2023-02-23T18:08:51.388Z INFO controllers.RayJob RayJob associated rayCluster found {"rayjob": "a8xqtvnds2sp7bgkn96k-fzvfpg5y-0", "raycluster": "ntropy-development/a8xqtvnds2sp7bgkn96k-fzvfpg5y-0-raycluster-xlg4f"}
2023-02-23T18:08:51.388Z INFO controllers.RayJob waiting for the cluster to be ready {"rayCluster": "a8xqtvnds2sp7bgkn96k-fzvfpg5y-0-raycluster-xlg4f"}
2023-02-23T18:08:54.388Z INFO controllers.RayJob reconciling RayJob {"NamespacedName": "ntropy-development/a8xqtvnds2sp7bgkn96k-fzvfpg5y-0"}
2023-02-23T18:08:54.388Z INFO controllers.RayJob RayJob associated rayCluster found {"rayjob": "a8xqtvnds2sp7bgkn96k-fzvfpg5y-0", "raycluster": "ntropy-development/a8xqtvnds2sp7bgkn96k-fzvfpg5y-0-raycluster-xlg4f"}
2023-02-23T18:08:54.389Z INFO controllers.RayJob waiting for the cluster to be ready {"rayCluster": "a8xqtvnds2sp7bgkn96k-fzvfpg5y-0-raycluster-xlg4f"}
2023-02-23T18:08:57.389Z INFO controllers.RayJob reconciling RayJob {"NamespacedName": "ntropy-development/a8xqtvnds2sp7bgkn96k-fzvfpg5y-0"}
These logs seem to be generated by the this piece of code:
https://github.com/ray-project/kuberay/blob/89f5fba8d6f868f9fedde1fbe22a6eccad88ecc1/ray-operator/controllers/ray/rayjob_controller.go#L174
and are unexpected as the cluster is healthy and I can use it on the side.
I would appreciate any help and advice. Do you think the operator version?
My flyte deployment is in version: 1.2.1
Ray in cluster is 2.2.0
flytekitplugins-ray: 1.2.7Kevin Su
02/23/2023, 9:46 PMAbdullah Mobeen
03/17/2023, 8:04 PMPadma Priya M
04/12/2023, 4:25 AM@task(task_config=ray_config, requests=Resources(mem="2000Mi", cpu="1"), limits=Resources(mem="3000Mi", cpu="2"))
- development:
- projectQuotaCpu:
value: "64"
- projectQuotaMemory:
value: "150Gi"
value: |
apiVersion: v1
kind: ResourceQuota
metadata:
name: project-quota
namespace: {{ namespace }}
spec:
hard:
limits.cpu: {{ projectQuotaCpu }}
limits.memory: {{ projectQuotaMemory }}
Peter Klingelhofer
04/12/2023, 9:00 PM\n
after `@workflow`(unsurprising as Jupyter Notebooks are typically run in the browser obviously), not sure if that could be causing the problem.Nandakumar Raghu
05/29/2023, 5:52 PMinline
section of the configuration
in values.yaml.
configuration:
inline:
configmap:
enabled_plugins:
# -- Task specific configuration [structure](<https://pkg.go.dev/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/config#GetConfig>)
tasks:
# -- Plugins configuration, [structure](<https://pkg.go.dev/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/config#TaskPluginConfig>)
task-plugins:
# -- [Enabled Plugins](<https://pkg.go.dev/github.com/flyteorg/flyteplugins/go/tasks/config#Config>). Enable SageMaker*, Athena if you install the backend
# plugins
enabled-plugins:
- container
- sidecar
- k8s-array
- ray
default-for-task-types:
container: container
sidecar: sidecar
container_array: k8s-array
ray: ray
I have all the ray pods running -
NAME READY STATUS RESTARTS AGE
flyte-flyte-binary-6cfdcfc575-9l42x 1/1 Running 0 3d2h
flyte-ray-cluster-kuberay-head-9q6jq 1/1 Running 0 147m
flyte-ray-cluster-kuberay-worker-workergroup-bts8b 1/1 Running 0 147m
kuberay-apiserver-d7bbb9864-htsw4 1/1 Running 0 97m
kuberay-operator-55c84695b8-vftmn 1/1 Running 0 11h
And also all the services -
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
flyte-flyte-binary-grpc ClusterIP x.x.x.x. <none> 8089/TCP 3d3h
flyte-flyte-binary-http ClusterIP x.x.x.x. <none> 8088/TCP 3d3h
flyte-flyte-binary-webhook ClusterIP x.x.x.x. <none> 443/TCP 3d3h
flyte-ray-cluster-kuberay-head-svc ClusterIP x.x.x.x. <none> 10001/TCP,6379/TCP,8265/TCP,8080/TCP,8000/TCP 166m
kuberay-apiserver-service NodePort x.x.x.x. <none> 8888:31888/TCP,8887:31887/TCP 116m
kuberay-operator ClusterIP x.x.x.x. <none> 8080/TCP 3d2h
Questions:
1. Have I configured flyte to use ray correctly using the configmap in values.yaml?
2. How do I verify that the ray task that Flyte says was successful was indeed run on a ray cluster?Peter Klingelhofer
05/31/2023, 11:02 PMSlackbot
06/01/2023, 12:41 PMMartin Bomio
06/22/2023, 5:42 PMPadma Priya M
06/29/2023, 1:29 PMPadma Priya M
07/13/2023, 1:27 PMgit clone <https://github.com/ray-project/kuberay.git>
cd kuberay
kubectl create -k ray-operator/config/default
When I submit the Ray workflow, rayjob, head and worker pods are getting created and also is up and running. But the Workflow is not getting submitted in the cluster and the job is in queue in console for more than 1 hour.
Script:
import typing
import ray
import time
from flytekit import Resources, task, workflow
from flytekitplugins.ray import HeadNodeConfig, RayJobConfig, WorkerNodeConfig
@ray.remote
def square(x):
return x * x
ray_config = RayJobConfig(
head_node_config=HeadNodeConfig(ray_start_params={"log-color": "True"}),
worker_node_config=[WorkerNodeConfig(group_name="ray-group", replicas=2)],
runtime_env={"pip": ["numpy", "pandas"]},
)
@task(task_config=ray_config, requests=Resources(mem="2000Mi", cpu="1"), limits=Resources(mem="3000Mi", cpu="2"))
def ray_task(n: int) -> typing.List[int]:
futures = [square.remote(i) for i in range(n)]
return ray.get(futures)
@workflow
def ray_workflow(n: int) -> typing.List[int]:
return ray_task(n=n)
if __name__ == "__main__":
print(ray_workflow(n=10))
Michael Tinsley
07/18/2023, 8:23 PM- apiGroups:
- <http://ray.io|ray.io>
resources:
- rayjobs
verbs:
- "*"
And the RayJob is in a SUCCEEDED state
apiVersion: <http://ray.io/v1alpha1|ray.io/v1alpha1>
kind: RayJob
metadata:
creationTimestamp: '2023-07-18T20:09:53Z'
finalizers:
- <http://ray.io/rayjob-finalizer|ray.io/rayjob-finalizer>
name: f9865b58322e24b91a6d-n0-0
namespace: flyte-playground-development
ownerReferences:
- apiVersion: <http://flyte.lyft.com/v1alpha1|flyte.lyft.com/v1alpha1>
blockOwnerDeletion: true
controller: true
kind: flyteworkflow
name: f9865b58322e24b91a6d
uid: dd5f635e-73b8-4641-b36f-96a47b39ce31
resourceVersion: '391072831'
uid: 6d09ea85-24f4-4191-a66c-098ceab3ad27
...
status:
endTime: '2023-07-18T20:10:13Z'
jobDeploymentStatus: Running
jobId: f9865b58322e24b91a6d-n0-0-9jcjv
jobStatus: SUCCEEDED
There isn’t anything in the logs to suggest propeller is having an issue removing it? Although I guess my question is this - Is it Flyte or Ray that is responsible for cleaning up the RayJob/RayCluster?
I’m running Flyte 1.7.0 and ray-operator 1.5.2 which I’ve seen others say is working from them? Any ideas?Abdullah Mobeen
07/24/2023, 8:39 PMBosco Raju
08/28/2023, 5:34 PMPadma Priya M
08/30/2023, 5:01 AMFailure # 1 (occurred at 2023-08-25_05-10-14)
Traceback (most recent call last):
File "/tmp/ray/session_2023-08-25_05-08-10_330236_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 934, in get_next_executor_event
future_result = ray.get(ready_future)
File "/tmp/ray/session_2023-08-25_05-08-10_330236_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/tmp/ray/session_2023-08-25_05-08-10_330236_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/worker.py", line 1833, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: ImplicitFunc
actor_id: 7ce680c3be6578ac3b02370c02000000
pid: 131
namespace: c2845d95-7689-447a-ab70-b45ab9bb75b8
ip: 172.22.1.70
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR_EXIT
Failure # 1 (occurred at 2023-08-24_15-04-28)
Traceback (most recent call last):
File "/tmp/ray/session_2023-08-24_15-02-51_915721_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 934, in get_next_executor_event
future_result = ray.get(ready_future)
File "/tmp/ray/session_2023-08-24_15-02-51_915721_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/tmp/ray/session_2023-08-24_15-02-51_915721_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/worker.py", line 1833, in get
raise value
ray.exceptions.RaySystemError: System error: Failed to unpickle serialized exception
traceback: Traceback (most recent call last):
File "/tmp/ray/session_2023-08-24_15-02-51_915721_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/exceptions.py", line 38, in from_ray_exception
return pickle.loads(ray_exception.serialized_exception)
File "/tmp/ray/session_2023-08-24_15-02-51_915721_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/mlflow/exceptions.py", line 83, in __init__
error_code = json.get("error_code", ErrorCode.Name(INTERNAL_ERROR))
AttributeError: 'str' object has no attribute 'get'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/tmp/ray/session_2023-08-24_15-02-51_915721_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/serialization.py", line 340, in deserialize_objects
obj = self._deserialize_object(data, metadata, object_ref)
File "/tmp/ray/session_2023-08-24_15-02-51_915721_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/serialization.py", line 260, in _deserialize_object
return RayError.from_bytes(obj)
File "/tmp/ray/session_2023-08-24_15-02-51_915721_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/exceptions.py", line 32, in from_bytes
return RayError.from_ray_exception(ray_exception)
File "/tmp/ray/session_2023-08-24_15-02-51_915721_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/exceptions.py", line 41, in from_ray_exception
raise RuntimeError(msg) from e
RuntimeError: Failed to unpickle serialized exception
can anyone suggest some ways to resolve this and also confirm if this is any issue from ray ?Petr Pilař
08/31/2023, 3:37 PMAnirudh Sridhar
09/06/2023, 12:50 PMFlyte starts a Ray dashboard by default that provides cluster metrics and logs across many machines in a single pane as well as Ray memory utilization while debugging memory errors. The dashboard helps Ray users understand Ray clusters and libraries.
But i dont see ray dashboard i just see flyte consoleAnirudh Sridhar
09/11/2023, 10:56 AMhttps://flyte-org.slack.com/files/U05RR32SN00/F05RNV5KE4D/screenshot_2023-09-11_at_2.11.01_pm.png▾
Defaulted container "ray-worker" out of: ray-worker, init-myservice (init)
Anirudh Sridhar
09/12/2023, 7:30 AMimport typing
from flytekit import ImageSpec, Resources, task, workflow
custom_image = ImageSpec(
name="ray-flyte-plugin",
registry="anirudh1905",
packages=["flytekitplugins-ray"],
)
if custom_image.is_container():
import ray
from flytekitplugins.ray import HeadNodeConfig, RayJobConfig, WorkerNodeConfig
@ray.remote
def f1(x):
return x * x
@ray.remote
def f2(x):
return x%2
ray_config = RayJobConfig(
head_node_config=HeadNodeConfig(ray_start_params={"log-color": "True"}),
worker_node_config=[WorkerNodeConfig(group_name="ray-group", replicas=1)],
runtime_env={"pip": ["numpy", "pandas"]}, # or runtime_env="./requirements.txt"
)
@task(cache=True, cache_version="0.2",
task_config=ray_config,
requests=Resources(mem="2Gi", cpu="1"),
container_image=custom_image,
)
def ray_task(n: int) -> int:
futures = [f2.remote(f1.remote(i)) for i in range(n)]
return sum(ray.get(futures))
@workflow
def ray_workflow(n: int) -> int:
return ray_task(n=n)
project_config.yaml
domain: development
project: flytesnacks
defaults:
cpu: "1"
memory: "2Gi"
limits:
cpu: "3"
memory: "8Gi"
I also tried with kuberay version 0.3 and 0.5.2 in both its not workingNandakumar Raghu
09/13/2023, 9:58 AMFranco Bocci
09/13/2023, 1:41 PM@dataclass
class WorkerNodeConfig:
group_name: str
replicas: int
min_replicas: typing.Optional[int] = None
max_replicas: typing.Optional[int] = None
ray_start_params: typing.Optional[typing.Dict[str, str]] = None
Franco Bocci
09/14/2023, 9:59 AMStarted a local Ray instance. View the dashboard at <http://127.0.0.1:8265>
Kevin Su
09/15/2023, 12:26 AMAbin Shahab
09/15/2023, 4:52 PMPadma Priya M
11/06/2023, 8:11 AMray_config = RayJobConfig(
head_node_config=HeadNodeConfig(ray_start_params={"log-color": "True"}),
worker_node_config=[WorkerNodeConfig(group_name="ray-group", replicas=1)],
runtime_env={"pip": ["numpy", "pandas"]},
)
For instance: The pod is assigned to a node of IP 172.22.1.123 and it has secondary IPs 172.22.1.234 and 172.22.1.456. The trials of the tuning process is running in secondary IPs and the trials running in 172.22.1.234 alone is running and providing proper results but when the trials running in other secondary IPs are getting failed with error. I have attached the error screenshot. The trials that are assigned to IP 172.22.1.234 alone is getting passed and trials assigned to other IPs are getting failed with the attached error message.
Why are the trials getting assigned to the secondary IPs and why the trials running in a single IP alone is getting passed and trials trying to assign to other IPs are getting error ?