mammoth-mouse-1111
12/20/2024, 5:05 PMflyte-binary , and I've added the necessary plugins for flyte to spin up ray clusters.
However, when I try to launch an example job (in thread), I can see ray head and worker pods being spun up but the job seems to be hanging. Checking the logs of the pods doesn't yield any useful error messaging.
I think it could be due to the head node needing more resources? I'm having some difficulty understanding how resources are allocated to nodes:
⢠The Resources object for the flyte task seems to override any defaults I set in task-resource-attributes. It overrides defaults for limits with the values for requests even if I don't specify a limit in Resources
⢠The Resources object seems to provision ray head and worker nodes with the same requests & limits for resources. However, I can ask for different resources in ray_start_params in the HeadNodeConfig. Is this how I'm supposed to provision the head node with different nodes? If the amount of resources I request for the head node exceeds the limits in the Resources object, what happens?
⢠If I set the request in the Resources object too high, fewer than the expected number of ray workers are created. I assume this is because we're bumping against the memory limit?
⢠If I set the limit in the Resources object too high, the head node of the ray cluster isn't created at all and the flyte execution hangs. Why is this the case?
No matter how I try to provision resources, the job itself doesn't seem to progress when it should be extremely fast.
Thank you, and any help would be greatly appreciated šmammoth-mouse-1111
12/20/2024, 5:06 PMimport typing
from flytekit import ImageSpec, Resources, task, workflow
import ray
from flytekitplugins.ray import HeadNodeConfig, RayJobConfig, WorkerNodeConfig
import time
custom_image = ImageSpec(
registry="<http://ghcr.io/flyteorg|ghcr.io/flyteorg>",
packages=["flytekitplugins-ray"],
# kuberay operator needs wget for readiness probe.
apt_packages=["wget"],
)
@ray.remote
def f(x):
print(x)
return x * x
ray_config = RayJobConfig(
head_node_config=HeadNodeConfig(
ray_start_params={
"log-color": "True",
"num-cpus": "4",
"memory": str(int(5e9)), # 5GB
"object-store-memory": str(int(5e8)) # 500MB
}
),
worker_node_config=[
WorkerNodeConfig(
group_name="parth-vaders-ray-cluster",
min_replicas=3,
replicas=3,
max_replicas=5
)
],
runtime_env={"pip": ["numpy", "pandas"]}, # or runtime_env="./requirements.txt"
enable_autoscaling=False,
shutdown_after_job_finishes=True,
ttl_seconds_after_finished=3600,
)
# Set resource request and limit here
# both head node and worker nodes get same resources...
@task(
task_config=ray_config,
requests=Resources(mem="500Mi", cpu="2"),
limits=Resources(mem="32Gi", cpu="32"),
container_image=custom_image,
)
def ray_task(n: int) -> typing.List[int]:
futures = [f.remote(i) for i in range(n)]
return ray.get(futures)
@workflow
def ray_workflow(n: int) -> typing.List[int]:
return ray_task(n=n)
if __name__ == "__main__":
print(ray_workflow(n=10))mammoth-mouse-1111
12/20/2024, 5:09 PMpyflyte run --remote ray_wf.py ray_workflow --n 10mammoth-mouse-1111
12/20/2024, 5:13 PMflyteidl 1.14.0
flytekit 1.14.0
flytekitplugins-ray 1.14.0
flytectl version 0.9.4average-finland-92144
01/02/2025, 4:50 PMaverage-finland-92144
01/02/2025, 4:54 PMmammoth-mouse-1111
01/06/2025, 6:50 PMAttributeError: 'V1PodSpec' object has no attribute 'to_flyte_idl'
my config looks like this:
worker_node_config = WorkerNodeConfig(
group_name="parth-test-cluster",
replicas=3,
min_replicas=3,
max_replicas=3,
k8s_pod=V1PodSpec(
containers=[
V1Container(
name="ray-task",
resources=V1ResourceRequirements(
requests={"cpu": "1", "mem": "1Gi"}
)
)
]
)
)
head_node_config = HeadNodeConfig(
ray_start_params={
"log-color": "True",
"num-cpus": "4",
"memory": str(int(5e9)),
"object-store-memory": str(int(1e9))
},
k8s_pod=V1PodSpec(
containers=[
V1Container(
name="ray-head",
resources=V1ResourceRequirements(
requests={"cpu": "4", "mem": "8Gi"}
)
)
]
)
)
ray_config = RayJobConfig(
worker_node_config=[worker_node_config],
head_node_config=head_node_config,
runtime_env={"pip": ["numpy", "pandas"]}, # or runtime_env="./requirements.txt"
enable_autoscaling=True,
shutdown_after_job_finishes=True,
)
any idea what I'm doing wrong here?clean-glass-36808
01/06/2025, 10:29 PMclean-glass-36808
01/06/2025, 10:33 PMclean-glass-36808
01/06/2025, 10:41 PMmammoth-mouse-1111
01/07/2025, 3:12 PMK8sPod object via the pod_spec dictionary but it looks like you are passing in random strings or ints for testing purposes. I don't know how I should specify in this dictionary that I want X CPU, Y Memory, etc.
Also what is to_flyte_idl() supposed to do on the RayJob? Is this necessary for all RayJob objects or just ones where we specify different resources for head vs worker pods?mammoth-mouse-1111
01/07/2025, 3:29 PMRayJob object or the RayJobConfig object?clean-glass-36808
01/07/2025, 5:49 PMclean-glass-36808
01/07/2025, 5:50 PMmammoth-mouse-1111
01/07/2025, 5:51 PMclean-glass-36808
01/07/2025, 7:54 PMcustom_pod_spec = {"containers": [{"name": "ray-worker", "resources": {"requests": {"cpu": "3", "memory": "2Gi"}, "limits": {"cpu": "3", "memory": "2Gi"}}}]}
ray_config = RayJobConfig(
worker_node_config=[WorkerNodeConfig(group_name="ray-group", replicas=1, k8s_pod=K8sPod(pod_spec=custom_pod_spec))],
...
)clean-glass-36808
01/07/2025, 7:54 PMclean-glass-36808
01/07/2025, 7:55 PMmammoth-mouse-1111
01/07/2025, 7:57 PMclean-glass-36808
01/07/2025, 7:57 PMā av git:(PIER-2845-v3) ā kubectl describe pod a1549409eb-n0-0-raycluster-s4kqw-worker-ray-group-zv7kz | grep Requests -A 3 <aws:vast-prod>
Requests:
cpu: 200m
memory: 256Mi
Environment:
--
Requests:
cpu: 3
memory: 2Gi
Liveness: exec [bash -c wget -T 2 -q -O- <http://localhost:52365/api/local_raylet_healthz> | grep success] delay=30s timeout=1s period=5s #success=1 #failure=120clean-glass-36808
01/07/2025, 7:58 PMray-head as the container name for head node pod specmammoth-mouse-1111
01/07/2025, 10:02 PMflyte-core that stuck around even after we were done experimenting and moved to flyte-binary. This was preventing the ray cluster from being spun up successfully. Deleting that quota from kubernetes has resolved the issue.
Specifically, it was the cluster_resource_manager in flyte-core.mammoth-mouse-1111
01/07/2025, 10:03 PMaverage-finland-92144
01/08/2025, 12:35 PMmammoth-mouse-1111
01/08/2025, 2:48 PMflyte-core will still leave this active. Either fixing that or better docs about projectQuota would be helpful imo