mammoth-mouse-1111
12/20/2024, 5:05 PMflyte-binary
, and I've added the necessary plugins for flyte to spin up ray clusters.
However, when I try to launch an example job (in thread), I can see ray head and worker pods being spun up but the job seems to be hanging. Checking the logs of the pods doesn't yield any useful error messaging.
I think it could be due to the head node needing more resources? I'm having some difficulty understanding how resources are allocated to nodes:
⢠The Resources
object for the flyte task seems to override any defaults I set in task-resource-attributes
. It overrides defaults for limits
with the values for requests
even if I don't specify a limit in Resources
⢠The Resources
object seems to provision ray head and worker nodes with the same requests & limits for resources. However, I can ask for different resources in ray_start_params
in the HeadNodeConfig
. Is this how I'm supposed to provision the head node with different nodes? If the amount of resources I request for the head node exceeds the limits in the Resources
object, what happens?
⢠If I set the request in the Resources
object too high, fewer than the expected number of ray workers are created. I assume this is because we're bumping against the memory limit?
⢠If I set the limit in the Resources
object too high, the head node of the ray cluster isn't created at all and the flyte execution hangs. Why is this the case?
No matter how I try to provision resources, the job itself doesn't seem to progress when it should be extremely fast.
Thank you, and any help would be greatly appreciated šmammoth-mouse-1111
12/20/2024, 5:06 PMimport typing
from flytekit import ImageSpec, Resources, task, workflow
import ray
from flytekitplugins.ray import HeadNodeConfig, RayJobConfig, WorkerNodeConfig
import time
custom_image = ImageSpec(
registry="<http://ghcr.io/flyteorg|ghcr.io/flyteorg>",
packages=["flytekitplugins-ray"],
# kuberay operator needs wget for readiness probe.
apt_packages=["wget"],
)
@ray.remote
def f(x):
print(x)
return x * x
ray_config = RayJobConfig(
head_node_config=HeadNodeConfig(
ray_start_params={
"log-color": "True",
"num-cpus": "4",
"memory": str(int(5e9)), # 5GB
"object-store-memory": str(int(5e8)) # 500MB
}
),
worker_node_config=[
WorkerNodeConfig(
group_name="parth-vaders-ray-cluster",
min_replicas=3,
replicas=3,
max_replicas=5
)
],
runtime_env={"pip": ["numpy", "pandas"]}, # or runtime_env="./requirements.txt"
enable_autoscaling=False,
shutdown_after_job_finishes=True,
ttl_seconds_after_finished=3600,
)
# Set resource request and limit here
# both head node and worker nodes get same resources...
@task(
task_config=ray_config,
requests=Resources(mem="500Mi", cpu="2"),
limits=Resources(mem="32Gi", cpu="32"),
container_image=custom_image,
)
def ray_task(n: int) -> typing.List[int]:
futures = [f.remote(i) for i in range(n)]
return ray.get(futures)
@workflow
def ray_workflow(n: int) -> typing.List[int]:
return ray_task(n=n)
if __name__ == "__main__":
print(ray_workflow(n=10))
mammoth-mouse-1111
12/20/2024, 5:09 PMpyflyte run --remote ray_wf.py ray_workflow --n 10
mammoth-mouse-1111
12/20/2024, 5:13 PMflyteidl 1.14.0
flytekit 1.14.0
flytekitplugins-ray 1.14.0
flytectl version 0.9.4average-finland-92144
01/02/2025, 4:50 PMaverage-finland-92144
01/02/2025, 4:54 PMmammoth-mouse-1111
01/06/2025, 6:50 PMAttributeError: 'V1PodSpec' object has no attribute 'to_flyte_idl'
my config looks like this:
worker_node_config = WorkerNodeConfig(
group_name="parth-test-cluster",
replicas=3,
min_replicas=3,
max_replicas=3,
k8s_pod=V1PodSpec(
containers=[
V1Container(
name="ray-task",
resources=V1ResourceRequirements(
requests={"cpu": "1", "mem": "1Gi"}
)
)
]
)
)
head_node_config = HeadNodeConfig(
ray_start_params={
"log-color": "True",
"num-cpus": "4",
"memory": str(int(5e9)),
"object-store-memory": str(int(1e9))
},
k8s_pod=V1PodSpec(
containers=[
V1Container(
name="ray-head",
resources=V1ResourceRequirements(
requests={"cpu": "4", "mem": "8Gi"}
)
)
]
)
)
ray_config = RayJobConfig(
worker_node_config=[worker_node_config],
head_node_config=head_node_config,
runtime_env={"pip": ["numpy", "pandas"]}, # or runtime_env="./requirements.txt"
enable_autoscaling=True,
shutdown_after_job_finishes=True,
)
any idea what I'm doing wrong here?clean-glass-36808
01/06/2025, 10:29 PMclean-glass-36808
01/06/2025, 10:33 PMclean-glass-36808
01/06/2025, 10:41 PMmammoth-mouse-1111
01/07/2025, 3:12 PMK8sPod
object via the pod_spec
dictionary but it looks like you are passing in random strings or ints for testing purposes. I don't know how I should specify in this dictionary that I want X CPU, Y Memory, etc.
Also what is to_flyte_idl()
supposed to do on the RayJob? Is this necessary for all RayJob
objects or just ones where we specify different resources for head vs worker pods?mammoth-mouse-1111
01/07/2025, 3:29 PMRayJob
object or the RayJobConfig
object?clean-glass-36808
01/07/2025, 5:49 PMclean-glass-36808
01/07/2025, 5:50 PMmammoth-mouse-1111
01/07/2025, 5:51 PMclean-glass-36808
01/07/2025, 7:54 PMcustom_pod_spec = {"containers": [{"name": "ray-worker", "resources": {"requests": {"cpu": "3", "memory": "2Gi"}, "limits": {"cpu": "3", "memory": "2Gi"}}}]}
ray_config = RayJobConfig(
worker_node_config=[WorkerNodeConfig(group_name="ray-group", replicas=1, k8s_pod=K8sPod(pod_spec=custom_pod_spec))],
...
)
clean-glass-36808
01/07/2025, 7:54 PMclean-glass-36808
01/07/2025, 7:55 PMmammoth-mouse-1111
01/07/2025, 7:57 PMclean-glass-36808
01/07/2025, 7:57 PMā av git:(PIER-2845-v3) ā kubectl describe pod a1549409eb-n0-0-raycluster-s4kqw-worker-ray-group-zv7kz | grep Requests -A 3 <aws:vast-prod>
Requests:
cpu: 200m
memory: 256Mi
Environment:
--
Requests:
cpu: 3
memory: 2Gi
Liveness: exec [bash -c wget -T 2 -q -O- <http://localhost:52365/api/local_raylet_healthz> | grep success] delay=30s timeout=1s period=5s #success=1 #failure=120
clean-glass-36808
01/07/2025, 7:58 PMray-head
as the container name for head node pod specmammoth-mouse-1111
01/07/2025, 10:02 PMflyte-core
that stuck around even after we were done experimenting and moved to flyte-binary
. This was preventing the ray cluster from being spun up successfully. Deleting that quota from kubernetes has resolved the issue.
Specifically, it was the cluster_resource_manager in flyte-core.mammoth-mouse-1111
01/07/2025, 10:03 PMaverage-finland-92144
01/08/2025, 12:35 PMmammoth-mouse-1111
01/08/2025, 2:48 PMflyte-core
will still leave this active. Either fixing that or better docs about projectQuota
would be helpful imo