Anirudh Sridhar
09/12/2023, 7:30 AMimport typing
from flytekit import ImageSpec, Resources, task, workflow
custom_image = ImageSpec(
name="ray-flyte-plugin",
registry="anirudh1905",
packages=["flytekitplugins-ray"],
)
if custom_image.is_container():
import ray
from flytekitplugins.ray import HeadNodeConfig, RayJobConfig, WorkerNodeConfig
@ray.remote
def f1(x):
return x * x
@ray.remote
def f2(x):
return x%2
ray_config = RayJobConfig(
head_node_config=HeadNodeConfig(ray_start_params={"log-color": "True"}),
worker_node_config=[WorkerNodeConfig(group_name="ray-group", replicas=1)],
runtime_env={"pip": ["numpy", "pandas"]}, # or runtime_env="./requirements.txt"
)
@task(cache=True, cache_version="0.2",
task_config=ray_config,
requests=Resources(mem="2Gi", cpu="1"),
container_image=custom_image,
)
def ray_task(n: int) -> int:
futures = [f2.remote(f1.remote(i)) for i in range(n)]
return sum(ray.get(futures))
@workflow
def ray_workflow(n: int) -> int:
return ray_task(n=n)
project_config.yaml
domain: development
project: flytesnacks
defaults:
cpu: "1"
memory: "2Gi"
limits:
cpu: "3"
memory: "8Gi"
I also tried with kuberay version 0.3 and 0.5.2 in both its not workingflytectl demo start
• Installed kuberay
export KUBERAY_VERSION=v0.5.2
kubectl create -k "<http://github.com/ray-project/kuberay/manifests/cluster-scope-resources?ref=${KUBERAY_VERSION}&timeout=90s|github.com/ray-project/kuberay/manifests/cluster-scope-resources?ref=${KUBERAY_VERSION}&timeout=90s>"
kubectl apply -k "<http://github.com/ray-project/kuberay/manifests/base?ref=${KUBERAY_VERSION}&timeout=90s|github.com/ray-project/kuberay/manifests/base?ref=${KUBERAY_VERSION}&timeout=90s>"
• flytectl update task-resource-attribute --attrFile project_config.yaml
• pyflyte run --remote example_ray.py ray_workflow --n 1
Samhita Alla
v0.6.0
, I'm seeing the following error in the task pod:
RuntimeError: Request failed with status code 500: Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/ray/dashboard/modules/job/job_head.py", line 287, in submit_job
resp = await job_agent_client.submit_job_internal(submit_request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/ray/dashboard/modules/job/job_head.py", line 73, in submit_job_internal
async with <http://self._session.post|self._session.post>(
File "/usr/local/lib/python3.11/site-packages/aiohttp/client.py", line 1141, in __aenter__
self._resp = await self._coro
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/aiohttp/client.py", line 560, in _request
await resp.start(conn)
File "/usr/local/lib/python3.11/site-packages/aiohttp/client_reqrep.py", line 899, in start
message, payload = await protocol.read() # type: ignore[union-attr]
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/aiohttp/streams.py", line 616, in read
await self._waiter
aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected
@Kevin Suv0.5.2
that I installed via helm, I'm seeing the following error:
2023-09-12T13:40:30.840Z ERROR controllers.RayJob failed to submit job {"error": "SubmitJob fail: Traceback (most recent call last):\n File \"/usr/local/lib/python3.10/site-packages/aiohttp/connector.py\", line 980, in _wrap_create_connection\n return await self._loop.create_connection(*args, **kwargs) # type: ignore[return-value] # noqa\n File \"/usr/local/lib/python3.10/asyncio/base_events.py\", line 1076, in create_connection\n raise exceptions[0]\n File \"/usr/local/lib/python3.10/asyncio/base_events.py\", line 1060, in create_connection\n sock = await self._connect_sock(\n File \"/usr/local/lib/python3.10/asyncio/base_events.py\", line 969, in _connect_sock\n await self.sock_connect(sock, address)\n File \"/usr/local/lib/python3.10/asyncio/selector_events.py\", line 501, in sock_connect\n return await fut\n File \"/usr/local/lib/python3.10/asyncio/selector_events.py\", line 541, in _sock_connect_cb\n raise OSError(err, f'Connect call failed {address}')\nConnectionRefusedError: [Errno 111] Connect call failed ('10.42.0.17', 52365)\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n File \"/usr/local/lib/python3.10/site-packages/ray/dashboard/modules/job/job_head.py\", line 287, in submit_job\n resp = await job_agent_client.submit_job_internal(submit_request)\n File \"/usr/local/lib/python3.10/site-packages/ray/dashboard/modules/job/job_head.py\", line 73, in submit_job_internal\n async with <http://self._session.post|self._session.post>(\n File \"/usr/local/lib/python3.10/site-packages/aiohttp/client.py\", line 1141, in __aenter__\n self._resp = await self._coro\n File \"/usr/local/lib/python3.10/site-packages/aiohttp/client.py\", line 536, in _request\n conn = await self._connector.connect(\n File \"/usr/local/lib/python3.10/site-packages/aiohttp/connector.py\", line 540, in connect\n proto = await self._create_connection(req, traces, timeout)\n File \"/usr/local/lib/python3.10/site-packages/aiohttp/connector.py\", line 901, in _create_connection\n _, proto = await self._create_direct_connection(req, traces, timeout)\n File \"/usr/local/lib/python3.10/site-packages/aiohttp/connector.py\", line 1209, in _create_direct_connection\n raise last_exc\n File \"/usr/local/lib/python3.10/site-packages/aiohttp/connector.py\", line 1178, in _create_direct_connection\n transp, proto = await self._wrap_create_connection(\n File \"/usr/local/lib/python3.10/site-packages/aiohttp/connector.py\", line 988, in _wrap_create_connection\n raise client_error(req.connection_key, exc) from exc\naiohttp.client_exceptions.ClientConnectorError: Cannot connect to host 10.42.0.17:52365 ssl:default [Connect call failed ('10.42.0.17', 52365)]\n"}
ImageSpec config:
custom_image = ImageSpec(
name="ray-flyte-plugin",
registry="samhitaalla",
packages=["flytekitplugins-ray==1.9.1", "ray==2.6.3"],
base_image="<http://ghcr.io/flyteorg/flytekit:py3.10-1.9.1|ghcr.io/flyteorg/flytekit:py3.10-1.9.1>"
)
Kevin Su
09/12/2023, 6:28 PMSamhita Alla
Anirudh Sridhar
09/14/2023, 10:20 AMKevin Su
09/14/2023, 3:41 PMAnirudh Sridhar
09/17/2023, 10:46 AMKevin Su
09/17/2023, 10:48 AMAnirudh Sridhar
09/17/2023, 10:52 AMKevin Su
09/17/2023, 10:53 AMkubeclt edit -n ray-system deploy kuberay-operator
to update the imagecustom_image = ImageSpec(
...
platform="linux/arm64"
)
Anirudh Sridhar
09/17/2023, 12:06 PMFailed to create Ray job: fe6dfd3a66f074481942-n0-0
Kevin Su
09/18/2023, 6:51 AMAnirudh Sridhar
09/18/2023, 6:51 AMKevin Su
09/18/2023, 6:58 AMAnirudh Sridhar
09/18/2023, 7:00 AManirudh1905/ray-flyte-plugin
Kevin Su
09/18/2023, 7:01 AMAnirudh Sridhar
09/18/2023, 7:01 AManirudh1905/ray-flyte-plugin:MyZdKEQHTVHkHlvw__ZUxQ..
Kevin Su
09/18/2023, 7:48 AMcustom_image = ImageSpec(
...
packages=["flytekitplugins-ray", "flytekit==1.9.0"],
platform="linux/arm64"
)
Anirudh Sridhar
09/18/2023, 12:15 PMdomain: development
project: flytesnacks
defaults:
cpu: "3"
memory: "2Gi"
limits:
cpu: "4"
memory: "8Gi"