Hi, I am performing XGBoost tuning with tune.run A...
# ray-integration
p
Hi, I am performing XGBoost tuning with tune.run API in flyte. Issue is that more than half of the trials are getting failed due to worker died unexpectedly inbetween. After getting this error message, worker and head pods are getting terminated and creating a new ray cluster again and executing. What is the reason for worker getting failed inbetween the process continuously. Even with very minimal number of trials like 2 trials, I am getting same issue. 1 trial is getting executed successfully and the other trial is getting failed with the same error message. Not able to figure out why this is happening and because of this lots of trials are getting failed even though there is no issue with the code.
Copy code
Failure # 1 (occurred at 2023-08-25_05-10-14)
 Traceback (most recent call last):
 File "/tmp/ray/session_2023-08-25_05-08-10_330236_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 934, in get_next_executor_event
   future_result = ray.get(ready_future)
 File "/tmp/ray/session_2023-08-25_05-08-10_330236_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
   return func(*args, **kwargs)
 File "/tmp/ray/session_2023-08-25_05-08-10_330236_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/worker.py", line 1833, in get
   raise value
 ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
       class_name: ImplicitFunc
       actor_id: 7ce680c3be6578ac3b02370c02000000
       pid: 131
       namespace: c2845d95-7689-447a-ab70-b45ab9bb75b8
       ip: 172.22.1.70
 The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR_EXIT
Copy code
Failure # 1 (occurred at 2023-08-24_15-04-28)
 Traceback (most recent call last):
 File "/tmp/ray/session_2023-08-24_15-02-51_915721_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 934, in get_next_executor_event
   future_result = ray.get(ready_future)
 File "/tmp/ray/session_2023-08-24_15-02-51_915721_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
   return func(*args, **kwargs)
 File "/tmp/ray/session_2023-08-24_15-02-51_915721_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/worker.py", line 1833, in get
   raise value
 ray.exceptions.RaySystemError: System error: Failed to unpickle serialized exception
 traceback: Traceback (most recent call last):
 File "/tmp/ray/session_2023-08-24_15-02-51_915721_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/exceptions.py", line 38, in from_ray_exception
   return pickle.loads(ray_exception.serialized_exception)
 File "/tmp/ray/session_2023-08-24_15-02-51_915721_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/mlflow/exceptions.py", line 83, in __init__
   error_code = json.get("error_code", ErrorCode.Name(INTERNAL_ERROR))
 AttributeError: 'str' object has no attribute 'get'
 
The above exception was the direct cause of the following exception:
 
Traceback (most recent call last):
 File "/tmp/ray/session_2023-08-24_15-02-51_915721_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/serialization.py", line 340, in deserialize_objects
   obj = self._deserialize_object(data, metadata, object_ref)
 File "/tmp/ray/session_2023-08-24_15-02-51_915721_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/serialization.py", line 260, in _deserialize_object
   return RayError.from_bytes(obj)
 File "/tmp/ray/session_2023-08-24_15-02-51_915721_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/exceptions.py", line 32, in from_bytes
   return RayError.from_ray_exception(ray_exception)
 File "/tmp/ray/session_2023-08-24_15-02-51_915721_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/exceptions.py", line 41, in from_ray_exception
   raise RuntimeError(msg) from e
 RuntimeError: Failed to unpickle serialized exception
can anyone suggest some ways to resolve this and also confirm if this is any issue from ray ?
s
k
Please contact the ray community