hey team, when i try running my code with ray plug...
# ray-integration
a
hey team, when i try running my code with ray plugin enabled the pod seems to be running file no issues in the log also, but the issue being the code keeps on running and i dont get the output and the code runs fine locally. Can u please look into it. Thanks! Code:
Copy code
import typing

from flytekit import ImageSpec, Resources, task, workflow

custom_image = ImageSpec(
    name="ray-flyte-plugin",
    registry="anirudh1905",
    packages=["flytekitplugins-ray"],
)

if custom_image.is_container():
    import ray
    from flytekitplugins.ray import HeadNodeConfig, RayJobConfig, WorkerNodeConfig

@ray.remote
def f1(x):
    return x * x

@ray.remote
def f2(x):
    return x%2

ray_config = RayJobConfig(
    head_node_config=HeadNodeConfig(ray_start_params={"log-color": "True"}),
    worker_node_config=[WorkerNodeConfig(group_name="ray-group", replicas=1)],
    runtime_env={"pip": ["numpy", "pandas"]},  # or runtime_env="./requirements.txt"
)

@task(cache=True, cache_version="0.2",
    task_config=ray_config,
    requests=Resources(mem="2Gi", cpu="1"),
    container_image=custom_image,
)
def ray_task(n: int) -> int:
    futures = [f2.remote(f1.remote(i)) for i in range(n)]
    return sum(ray.get(futures))


@workflow
def ray_workflow(n: int) -> int:
    return ray_task(n=n)
project_config.yaml
Copy code
domain: development
project: flytesnacks
defaults:
  cpu: "1"
  memory: "2Gi"
limits:
  cpu: "3"
  memory: "8Gi"
I also tried with kuberay version 0.3 and 0.5.2 in both its not working
Steps that i followed: •
flytectl demo start
• Installed kuberay
Copy code
export KUBERAY_VERSION=v0.5.2
kubectl create -k "<http://github.com/ray-project/kuberay/manifests/cluster-scope-resources?ref=${KUBERAY_VERSION}&timeout=90s|github.com/ray-project/kuberay/manifests/cluster-scope-resources?ref=${KUBERAY_VERSION}&timeout=90s>"
kubectl apply -k "<http://github.com/ray-project/kuberay/manifests/base?ref=${KUBERAY_VERSION}&timeout=90s|github.com/ray-project/kuberay/manifests/base?ref=${KUBERAY_VERSION}&timeout=90s>"
flytectl update task-resource-attribute --attrFile project_config.yaml
pyflyte run --remote example_ray.py ray_workflow --n 1
s
With kuberay
v0.6.0
, I'm seeing the following error in the task pod:
Copy code
RuntimeError: Request failed with status code 500: Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/ray/dashboard/modules/job/job_head.py", line 287, in submit_job
    resp = await job_agent_client.submit_job_internal(submit_request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/ray/dashboard/modules/job/job_head.py", line 73, in submit_job_internal
    async with <http://self._session.post|self._session.post>(
  File "/usr/local/lib/python3.11/site-packages/aiohttp/client.py", line 1141, in __aenter__
    self._resp = await self._coro
                 ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/aiohttp/client.py", line 560, in _request
    await resp.start(conn)
  File "/usr/local/lib/python3.11/site-packages/aiohttp/client_reqrep.py", line 899, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/aiohttp/streams.py", line 616, in read
    await self._waiter
aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected
@Kevin Su
With kuberay
v0.5.2
that I installed via helm, I'm seeing the following error:
Copy code
2023-09-12T13:40:30.840Z	ERROR	controllers.RayJob	failed to submit job	{"error": "SubmitJob fail: Traceback (most recent call last):\n  File \"/usr/local/lib/python3.10/site-packages/aiohttp/connector.py\", line 980, in _wrap_create_connection\n    return await self._loop.create_connection(*args, **kwargs)  # type: ignore[return-value]  # noqa\n  File \"/usr/local/lib/python3.10/asyncio/base_events.py\", line 1076, in create_connection\n    raise exceptions[0]\n  File \"/usr/local/lib/python3.10/asyncio/base_events.py\", line 1060, in create_connection\n    sock = await self._connect_sock(\n  File \"/usr/local/lib/python3.10/asyncio/base_events.py\", line 969, in _connect_sock\n    await self.sock_connect(sock, address)\n  File \"/usr/local/lib/python3.10/asyncio/selector_events.py\", line 501, in sock_connect\n    return await fut\n  File \"/usr/local/lib/python3.10/asyncio/selector_events.py\", line 541, in _sock_connect_cb\n    raise OSError(err, f'Connect call failed {address}')\nConnectionRefusedError: [Errno 111] Connect call failed ('10.42.0.17', 52365)\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.10/site-packages/ray/dashboard/modules/job/job_head.py\", line 287, in submit_job\n    resp = await job_agent_client.submit_job_internal(submit_request)\n  File \"/usr/local/lib/python3.10/site-packages/ray/dashboard/modules/job/job_head.py\", line 73, in submit_job_internal\n    async with <http://self._session.post|self._session.post>(\n  File \"/usr/local/lib/python3.10/site-packages/aiohttp/client.py\", line 1141, in __aenter__\n    self._resp = await self._coro\n  File \"/usr/local/lib/python3.10/site-packages/aiohttp/client.py\", line 536, in _request\n    conn = await self._connector.connect(\n  File \"/usr/local/lib/python3.10/site-packages/aiohttp/connector.py\", line 540, in connect\n    proto = await self._create_connection(req, traces, timeout)\n  File \"/usr/local/lib/python3.10/site-packages/aiohttp/connector.py\", line 901, in _create_connection\n    _, proto = await self._create_direct_connection(req, traces, timeout)\n  File \"/usr/local/lib/python3.10/site-packages/aiohttp/connector.py\", line 1209, in _create_direct_connection\n    raise last_exc\n  File \"/usr/local/lib/python3.10/site-packages/aiohttp/connector.py\", line 1178, in _create_direct_connection\n    transp, proto = await self._wrap_create_connection(\n  File \"/usr/local/lib/python3.10/site-packages/aiohttp/connector.py\", line 988, in _wrap_create_connection\n    raise client_error(req.connection_key, exc) from exc\naiohttp.client_exceptions.ClientConnectorError: Cannot connect to host 10.42.0.17:52365 ssl:default [Connect call failed ('10.42.0.17', 52365)]\n"}
ImageSpec config:
Copy code
custom_image = ImageSpec(
    name="ray-flyte-plugin",
    registry="samhitaalla",
    packages=["flytekitplugins-ray==1.9.1", "ray==2.6.3"],
    base_image="<http://ghcr.io/flyteorg/flytekit:py3.10-1.9.1|ghcr.io/flyteorg/flytekit:py3.10-1.9.1>"
)
k
I create an issue here, Ray seems to have some issues, so it failed to run the task in the sandbox. The same task can be run on the eks cluster.
s
Should we be making any changes to the plugin to fix this?
a
Hey @Kevin Su i need it for the POC
k
Ok, will deep dive into it today
I find the bug. are you using mac M1 or M2?
cc @Anirudh Sridhar
a
M1 Max chip
k
kuberay doesn’t build the image for arm64.
so kuberay operator keeps failing to submit the job.
a
So what should I do to test it locally?
k
I’m building the arm image right now, I can share it with you shortly.
then you can replace current kuberay image with mine
pingsutw/kuberay:v2
you could use
kubeclt edit -n ray-system deploy kuberay-operator
to update the image
Copy code
custom_image = ImageSpec(
    ...
    platform="linux/arm64"
)
pass platform config to image Spec for building the image for m1
a
the pods are created but the flyte console giving this error
Failed to create Ray job: fe6dfd3a66f074481942-n0-0
@Kevin Su
and i saw the logs of the head node everything looks fine
k
are you using kuberay 0.5.2?
did you see any error in kuberay operator?
a
no
k
do push your image to docker hub? if so, could you share the image uri? I can test it on my side
a
ig this is the one
anirudh1905/ray-flyte-plugin
k
what’s its tag
a
anirudh1905/ray-flyte-plugin:MyZdKEQHTVHkHlvw__ZUxQ..
k
there are some issues in flytekit 1.9,1. I just tested 1.9.0, and it works for me.
Copy code
custom_image = ImageSpec(
    ...
    packages=["flytekitplugins-ray", "flytekit==1.9.0"],
    platform="linux/arm64"
)
a
Hey @Kevin Su the code works fine for n=1 and n=10 but fails for 20 and starts giving error insufficient cpu Also 1 more observation when i pass 3 as cpu it starts giving insufficient cpu for n=1 also given that i have given 5 cpu to docker desktop and this is my project_config.yaml
Copy code
domain: development
project: flytesnacks
defaults:
  cpu: "3"
  memory: "2Gi"
limits:
  cpu: "4"
  memory: "8Gi"