Hi all slightly smiling face I m a flyte beginner and am try Flyte #flyte-support

Hi all :slightly_smiling_face: I'm a flyte beginn...

mammoth-mouse-1111

12/20/2024, 5:05 PM

Hi all 🙂 I'm a flyte beginner and am trying to deploy flyte on my company's on-prem server with a ray backend. I've followed the instructions on the flyte-the-hard-way repo to deploy

flyte-binary

, and I've added the necessary plugins for flyte to spin up ray clusters. However, when I try to launch an example job (in thread), I can see ray head and worker pods being spun up but the job seems to be hanging. Checking the logs of the pods doesn't yield any useful error messaging. I think it could be due to the head node needing more resources? I'm having some difficulty understanding how resources are allocated to nodes: • The

Resources

object for the flyte task seems to override any defaults I set in

task-resource-attributes

. It overrides defaults for

limits

with the values for

requests

even if I don't specify a limit in

Resources

• The

Resources

object seems to provision ray head and worker nodes with the same requests & limits for resources. However, I can ask for different resources in

ray_start_params

in the

HeadNodeConfig

. Is this how I'm supposed to provision the head node with different nodes? If the amount of resources I request for the head node exceeds the limits in the

Resources

object, what happens? • If I set the request in the

Resources

object too high, fewer than the expected number of ray workers are created. I assume this is because we're bumping against the memory limit? • If I set the limit in the

Resources

object too high, the head node of the ray cluster isn't created at all and the flyte execution hangs. Why is this the case? No matter how I try to provision resources, the job itself doesn't seem to progress when it should be extremely fast. Thank you, and any help would be greatly appreciated 🙏

mammoth-mouse-1111

12/20/2024, 5:06 PM

Copy code

import typing
from flytekit import ImageSpec, Resources, task, workflow
import ray
from flytekitplugins.ray import HeadNodeConfig, RayJobConfig, WorkerNodeConfig
import time

custom_image = ImageSpec(
    registry="<http://ghcr.io/flyteorg|ghcr.io/flyteorg>",
    packages=["flytekitplugins-ray"],
    # kuberay operator needs wget for readiness probe.
    apt_packages=["wget"],
)

@ray.remote
def f(x):
    print(x)
    return x * x


ray_config = RayJobConfig(
    head_node_config=HeadNodeConfig(
        ray_start_params={
            "log-color": "True",
            "num-cpus": "4",
            "memory": str(int(5e9)), # 5GB
            "object-store-memory": str(int(5e8)) # 500MB
        }
    ),
    worker_node_config=[
        WorkerNodeConfig(
            group_name="parth-vaders-ray-cluster",
            min_replicas=3,
            replicas=3,
            max_replicas=5
        )
    ],
    runtime_env={"pip": ["numpy", "pandas"]},  # or runtime_env="./requirements.txt"
    enable_autoscaling=False,
    shutdown_after_job_finishes=True,
    ttl_seconds_after_finished=3600,
)

# Set resource request and limit here
# both head node and worker nodes get same resources...
@task(
    task_config=ray_config,
    requests=Resources(mem="500Mi", cpu="2"),
    limits=Resources(mem="32Gi", cpu="32"),
    container_image=custom_image,
)
def ray_task(n: int) -> typing.List[int]:
    futures = [f.remote(i) for i in range(n)]
    return ray.get(futures)

@workflow
def ray_workflow(n: int) -> typing.List[int]:
    return ray_task(n=n)

if __name__ == "__main__":
    print(ray_workflow(n=10))

mammoth-mouse-1111

12/20/2024, 5:09 PM

Running above script with

Copy code

pyflyte run --remote ray_wf.py ray_workflow --n 10

mammoth-mouse-1111

12/20/2024, 5:13 PM

package versions:

Copy code

flyteidl                  1.14.0
flytekit                  1.14.0
flytekitplugins-ray       1.14.0

flytectl version 0.9.4

average-finland-92144

01/02/2025, 4:50 PM

@mammoth-mouse-1111 sorry for the delays. Considering you're using latest flyte versions, you could try passing requirements to Ray Head and Worker nodes as part of a PodSpec (see example in the release notes)

average-finland-92144

01/02/2025, 4:54 PM

it's true that flytekit makes requests==limits and that Resource config at the task level, overrides what you set in the task_resources section

mammoth-mouse-1111

01/06/2025, 6:50 PM

Hi David - thanks for getting back to me. I tried passing requirements with PodSpec but hit this following error:

Copy code

AttributeError: 'V1PodSpec' object has no attribute 'to_flyte_idl'

my config looks like this:

Copy code

worker_node_config = WorkerNodeConfig(
    group_name="parth-test-cluster",
    replicas=3,
    min_replicas=3,
    max_replicas=3,
    k8s_pod=V1PodSpec(
        containers=[
            V1Container(
                name="ray-task",
                resources=V1ResourceRequirements(
                    requests={"cpu": "1", "mem": "1Gi"}
                )
            )
        ]
    )
)

head_node_config = HeadNodeConfig(
    ray_start_params={
        "log-color": "True",
        "num-cpus": "4",
        "memory": str(int(5e9)),
        "object-store-memory": str(int(1e9))
    },
    k8s_pod=V1PodSpec(
        containers=[
            V1Container(
                name="ray-head",
                resources=V1ResourceRequirements(
                    requests={"cpu": "4", "mem": "8Gi"}
                )
            )
        ]
    )
)

ray_config = RayJobConfig(
    worker_node_config=[worker_node_config],
    head_node_config=head_node_config,
    runtime_env={"pip": ["numpy", "pandas"]},  # or runtime_env="./requirements.txt"
    enable_autoscaling=True,
    shutdown_after_job_finishes=True,
)

any idea what I'm doing wrong here?

clean-glass-36808

01/06/2025, 10:29 PM

I'm not sure the release notes has the correct syntax. See the unit tests here for some working examples: https://github.com/flyteorg/flytekit/blob/master/plugins/flytekit-ray/tests/test_ray.py#L59-L77

clean-glass-36808

01/06/2025, 10:33 PM

I worked with @average-finland-92144 on those release notes so definitely my fault on that one.

clean-glass-36808

01/06/2025, 10:41 PM

I'll be experimenting with this myself this week so should have a full example soon.

mammoth-mouse-1111

01/07/2025, 3:12 PM

Sorry Jason, the unit tests don't seem to clarify how to use this API either. It looks like you pass everything to the flyte.models

K8sPod

object via the

pod_spec

dictionary but it looks like you are passing in random strings or ints for testing purposes. I don't know how I should specify in this dictionary that I want X CPU, Y Memory, etc. Also what is

to_flyte_idl()

supposed to do on the RayJob? Is this necessary for all

RayJob

objects or just ones where we specify different resources for head vs worker pods?

mammoth-mouse-1111

01/07/2025, 3:29 PM

Also wait are you suggesting I use the

RayJob

object or the

RayJobConfig

object?

clean-glass-36808

01/07/2025, 5:49 PM

I meant to refer to this: https://github.com/flyteorg/flytekit/blob/master/plugins/flytekit-ray/tests/test_ray.py#L20-L35

clean-glass-36808

01/07/2025, 5:50 PM

I am close to testing the custom resources, will report back

mammoth-mouse-1111

01/07/2025, 5:51 PM

Thank you!

clean-glass-36808

01/07/2025, 7:54 PM

ok this works

Copy code

custom_pod_spec = {"containers": [{"name": "ray-worker", "resources": {"requests": {"cpu": "3", "memory": "2Gi"}, "limits": {"cpu": "3", "memory": "2Gi"}}}]}

ray_config = RayJobConfig(
    worker_node_config=[WorkerNodeConfig(group_name="ray-group", replicas=1, k8s_pod=K8sPod(pod_spec=custom_pod_spec))],
...
)

clean-glass-36808

01/07/2025, 7:54 PM

So you need to specify it in the container and match the container name for worker/head

clean-glass-36808

01/07/2025, 7:55 PM

There might be a more elegant way to create the JSON but yeah

mammoth-mouse-1111

01/07/2025, 7:57 PM

thanks - I'll give this a shot and get back to you!

clean-glass-36808

01/07/2025, 7:57 PM

Copy code

➜  av git:(PIER-2845-v3) ✗ kubectl describe pod a1549409eb-n0-0-raycluster-s4kqw-worker-ray-group-zv7kz | grep Requests -A 3                                                                                                                                                                                                                                                 <aws:vast-prod>
    Requests:
      cpu:     200m
      memory:  256Mi
    Environment:
--
    Requests:
      cpu:      3
      memory:   2Gi
    Liveness:   exec [bash -c wget -T 2 -q -O- <http://localhost:52365/api/local_raylet_healthz> | grep success] delay=30s timeout=1s period=5s #success=1 #failure=120

clean-glass-36808

01/07/2025, 7:58 PM

Use

ray-head

as the container name for head node pod spec

mammoth-mouse-1111

01/07/2025, 10:02 PM

Thanks Jason! After investigating with my coworker it turns out that there was actually a project quota set by

flyte-core

that stuck around even after we were done experimenting and moved to

flyte-binary

. This was preventing the ray cluster from being spun up successfully. Deleting that quota from kubernetes has resolved the issue. Specifically, it was the cluster_resource_manager in flyte-core.

mammoth-mouse-1111

01/07/2025, 10:03 PM

I tried the configs you suggested above and "succeeded" (hit a different error) so I'm pretty sure what you suggested works. Thanks for all the help!

average-finland-92144

01/08/2025, 12:35 PM

Thanks so much @clean-glass-36808! I'll update the release notes example. @mammoth-mouse-1111 do you think we should get rid of the projectQuota that comes with flyte-core? I think it's too opinionated and could cause these kinds of problems

mammoth-mouse-1111

01/08/2025, 2:48 PM

I haven't been using flyte long enough to have an opinion about projectQuota lol What did trip me up though is that `helm uninstall`ing

flyte-core

will still leave this active. Either fixing that or better docs about

projectQuota

would be helpful imo

12 Views

Open in Slack

Previous Next