Hi all :slightly_smiling_face: I'm a flyte beginn...
# flyte-support
m
Hi all šŸ™‚ I'm a flyte beginner and am trying to deploy flyte on my company's on-prem server with a ray backend. I've followed the instructions on the flyte-the-hard-way repo to deploy
flyte-binary
, and I've added the necessary plugins for flyte to spin up ray clusters. However, when I try to launch an example job (in thread), I can see ray head and worker pods being spun up but the job seems to be hanging. Checking the logs of the pods doesn't yield any useful error messaging. I think it could be due to the head node needing more resources? I'm having some difficulty understanding how resources are allocated to nodes: • The
Resources
object for the flyte task seems to override any defaults I set in
task-resource-attributes
. It overrides defaults for
limits
with the values for
requests
even if I don't specify a limit in
Resources
• The
Resources
object seems to provision ray head and worker nodes with the same requests & limits for resources. However, I can ask for different resources in
ray_start_params
in the
HeadNodeConfig
. Is this how I'm supposed to provision the head node with different nodes? If the amount of resources I request for the head node exceeds the limits in the
Resources
object, what happens? • If I set the request in the
Resources
object too high, fewer than the expected number of ray workers are created. I assume this is because we're bumping against the memory limit? • If I set the limit in the
Resources
object too high, the head node of the ray cluster isn't created at all and the flyte execution hangs. Why is this the case? No matter how I try to provision resources, the job itself doesn't seem to progress when it should be extremely fast. Thank you, and any help would be greatly appreciated šŸ™
Copy code
import typing
from flytekit import ImageSpec, Resources, task, workflow
import ray
from flytekitplugins.ray import HeadNodeConfig, RayJobConfig, WorkerNodeConfig
import time

custom_image = ImageSpec(
    registry="<http://ghcr.io/flyteorg|ghcr.io/flyteorg>",
    packages=["flytekitplugins-ray"],
    # kuberay operator needs wget for readiness probe.
    apt_packages=["wget"],
)

@ray.remote
def f(x):
    print(x)
    return x * x


ray_config = RayJobConfig(
    head_node_config=HeadNodeConfig(
        ray_start_params={
            "log-color": "True",
            "num-cpus": "4",
            "memory": str(int(5e9)), # 5GB
            "object-store-memory": str(int(5e8)) # 500MB
        }
    ),
    worker_node_config=[
        WorkerNodeConfig(
            group_name="parth-vaders-ray-cluster",
            min_replicas=3,
            replicas=3,
            max_replicas=5
        )
    ],
    runtime_env={"pip": ["numpy", "pandas"]},  # or runtime_env="./requirements.txt"
    enable_autoscaling=False,
    shutdown_after_job_finishes=True,
    ttl_seconds_after_finished=3600,
)

# Set resource request and limit here
# both head node and worker nodes get same resources...
@task(
    task_config=ray_config,
    requests=Resources(mem="500Mi", cpu="2"),
    limits=Resources(mem="32Gi", cpu="32"),
    container_image=custom_image,
)
def ray_task(n: int) -> typing.List[int]:
    futures = [f.remote(i) for i in range(n)]
    return ray.get(futures)

@workflow
def ray_workflow(n: int) -> typing.List[int]:
    return ray_task(n=n)

if __name__ == "__main__":
    print(ray_workflow(n=10))
Running above script with
Copy code
pyflyte run --remote ray_wf.py ray_workflow --n 10
package versions:
Copy code
flyteidl                  1.14.0
flytekit                  1.14.0
flytekitplugins-ray       1.14.0
flytectl version 0.9.4
a
@mammoth-mouse-1111 sorry for the delays. Considering you're using latest flyte versions, you could try passing requirements to Ray Head and Worker nodes as part of a PodSpec (see example in the release notes)
it's true that flytekit makes requests==limits and that Resource config at the task level, overrides what you set in the task_resources section
m
Hi David - thanks for getting back to me. I tried passing requirements with PodSpec but hit this following error:
Copy code
AttributeError: 'V1PodSpec' object has no attribute 'to_flyte_idl'
my config looks like this:
Copy code
worker_node_config = WorkerNodeConfig(
    group_name="parth-test-cluster",
    replicas=3,
    min_replicas=3,
    max_replicas=3,
    k8s_pod=V1PodSpec(
        containers=[
            V1Container(
                name="ray-task",
                resources=V1ResourceRequirements(
                    requests={"cpu": "1", "mem": "1Gi"}
                )
            )
        ]
    )
)

head_node_config = HeadNodeConfig(
    ray_start_params={
        "log-color": "True",
        "num-cpus": "4",
        "memory": str(int(5e9)),
        "object-store-memory": str(int(1e9))
    },
    k8s_pod=V1PodSpec(
        containers=[
            V1Container(
                name="ray-head",
                resources=V1ResourceRequirements(
                    requests={"cpu": "4", "mem": "8Gi"}
                )
            )
        ]
    )
)

ray_config = RayJobConfig(
    worker_node_config=[worker_node_config],
    head_node_config=head_node_config,
    runtime_env={"pip": ["numpy", "pandas"]},  # or runtime_env="./requirements.txt"
    enable_autoscaling=True,
    shutdown_after_job_finishes=True,
)
any idea what I'm doing wrong here?
c
I'm not sure the release notes has the correct syntax. See the unit tests here for some working examples: https://github.com/flyteorg/flytekit/blob/master/plugins/flytekit-ray/tests/test_ray.py#L59-L77
I worked with @average-finland-92144 on those release notes so definitely my fault on that one.
I'll be experimenting with this myself this week so should have a full example soon.
m
Sorry Jason, the unit tests don't seem to clarify how to use this API either. It looks like you pass everything to the flyte.models
K8sPod
object via the
pod_spec
dictionary but it looks like you are passing in random strings or ints for testing purposes. I don't know how I should specify in this dictionary that I want X CPU, Y Memory, etc. Also what is
to_flyte_idl()
supposed to do on the RayJob? Is this necessary for all
RayJob
objects or just ones where we specify different resources for head vs worker pods?
Also wait are you suggesting I use the
RayJob
object or the
RayJobConfig
object?
I am close to testing the custom resources, will report back
m
Thank you!
c
ok this works
Copy code
custom_pod_spec = {"containers": [{"name": "ray-worker", "resources": {"requests": {"cpu": "3", "memory": "2Gi"}, "limits": {"cpu": "3", "memory": "2Gi"}}}]}

ray_config = RayJobConfig(
    worker_node_config=[WorkerNodeConfig(group_name="ray-group", replicas=1, k8s_pod=K8sPod(pod_spec=custom_pod_spec))],
...
)
So you need to specify it in the container and match the container name for worker/head
There might be a more elegant way to create the JSON but yeah
m
thanks - I'll give this a shot and get back to you!
c
Copy code
āžœ  av git:(PIER-2845-v3) āœ— kubectl describe pod a1549409eb-n0-0-raycluster-s4kqw-worker-ray-group-zv7kz | grep Requests -A 3                                                                                                                                                                                                                                                 <aws:vast-prod>
    Requests:
      cpu:     200m
      memory:  256Mi
    Environment:
--
    Requests:
      cpu:      3
      memory:   2Gi
    Liveness:   exec [bash -c wget -T 2 -q -O- <http://localhost:52365/api/local_raylet_healthz> | grep success] delay=30s timeout=1s period=5s #success=1 #failure=120
Use
ray-head
as the container name for head node pod spec
m
Thanks Jason! After investigating with my coworker it turns out that there was actually a project quota set by
flyte-core
that stuck around even after we were done experimenting and moved to
flyte-binary
. This was preventing the ray cluster from being spun up successfully. Deleting that quota from kubernetes has resolved the issue. Specifically, it was the cluster_resource_manager in flyte-core.
I tried the configs you suggested above and "succeeded" (hit a different error) so I'm pretty sure what you suggested works. Thanks for all the help!
a
Thanks so much @clean-glass-36808! I'll update the release notes example. @mammoth-mouse-1111 do you think we should get rid of the projectQuota that comes with flyte-core? I think it's too opinionated and could cause these kinds of problems
m
I haven't been using flyte long enough to have an opinion about projectQuota lol What did trip me up though is that `helm uninstall`ing
flyte-core
will still leave this active. Either fixing that or better docs about
projectQuota
would be helpful imo