Hi was trying distributed training using ray in flyte I am g Flyte #ray-integration

Hi, was trying distributed training using ray in f...

future-notebook-79388

11/29/2022, 2:57 PM

Hi, was trying distributed training using ray in flyte. I am getting this error while running.

Copy code

from flytekitplugins.ray import HeadNodeConfig, RayJobConfig, WorkerNodeConfig
import ray
from ray import tune

#ray.init()
#ray.init("auto", ignore_reinit_error=True)

ray_config = RayJobConfig(
    head_node_config=HeadNodeConfig(ray_start_params={"log-color": "True"}),
    worker_node_config=[WorkerNodeConfig(group_name="ray-group", replicas=2)],
)

num_actors = 4
num_cpus_per_actor = 1

ray_params = RayParams(
    num_actors=num_actors, cpus_per_actor=num_cpus_per_actor)


def train_model(config):
    train_x, train_y = load_breast_cancer(return_X_y=True)
    train_set = RayDMatrix(train_x, train_y)

    evals_result = {}
    bst = train(
        params=config,
        dtrain=train_set,
        evals_result=evals_result,
        evals=[(train_set, "train")],
        verbose_eval=False,
        ray_params=ray_params)
    bst.save_model("model.xgb")



@task(task_config=ray_config, limits=Resources(mem="2000Mi", cpu="1"))
def train_model_task() -> dict:
    config = {
        "tree_method": "approx",
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
        "eta": tune.loguniform(1e-4, 1e-1),
        "subsample": tune.uniform(0.5, 1.0),
        "max_depth": tune.randint(1, 9)
    }


    analysis = tune.run(
        train_model,
        config=config,
        metric="train-error",
        mode="min",
        num_samples=4,
        resources_per_trial=ray_params.get_tune_resources())
    return analysis.best_config

@workflow
def train_model_wf() -> dict:
    return train_model_task()

freezing-airport-6809

11/29/2022, 4:09 PM

Running out of disk

freezing-airport-6809

11/29/2022, 4:09 PM

Request more pleas

tall-lock-23197

11/30/2022, 4:21 AM

@task(task_config=ray_config, limits=Resources(mem="2000Mi", cpu="1", ephemeral_storage="500Mi"))

future-notebook-79388

12/01/2022, 10:27 AM

Copy code

from sklearn.datasets import load_breast_cancer
from flytekit import Resources, task, workflow
from flytekitplugins.ray import HeadNodeConfig, RayJobConfig, WorkerNodeConfig
import ray
from ray import tune

#ray.shutdown()
#ray.init()
#ray.init("auto", ignore_reinit_error=True)

ray_config = RayJobConfig(
    head_node_config=HeadNodeConfig(ray_start_params={"log-color": "True"}),
    worker_node_config=[WorkerNodeConfig(group_name="ray-group", replicas=3)],
)

num_actors = 2
num_cpus_per_actor = 1

ray_params = RayParams(
    num_actors=num_actors, cpus_per_actor=num_cpus_per_actor)


def train_model(config):
    train_x, train_y = load_breast_cancer(return_X_y=True)
    train_set = RayDMatrix(train_x, train_y)

    evals_result = {}
    bst = train(
        params=config,
        dtrain=train_set,
        evals_result=evals_result,
        evals=[(train_set, "train")],
        verbose_eval=False,
        ray_params=ray_params)
    bst.save_model("model.xgb")



#@task(limits=Resources(mem="2000Mi", cpu="1"))
@task(task_config=ray_config, limits=Resources(mem="3000Mi", cpu="1", ephemeral_storage="3000Mi"))
def train_model_task() -> dict:
    config = {
        "tree_method": "approx",
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
        "eta": tune.loguniform(1e-4, 1e-1),
        "subsample": tune.uniform(0.5, 1.0),
        "max_depth": tune.randint(1, 9)
    }

    analysis = tune.run(
        train_model,
        config=config,
        metric="train-error",
        mode="min",
        num_samples=4,
        max_concurrent_trials=1,
        resources_per_trial=ray_params.get_tune_resources())
    return analysis.best_config

@workflow
def train_model_wf() -> dict:
    return train_model_task()

Still getting this error when we specify

ephemeral_storage

value also. do u have any suggested limit for cpu and memory

tall-lock-23197

12/01/2022, 1:28 PM

If you’re using demo cluster, I think 1Gi is the limit.

future-notebook-79388

12/01/2022, 1:38 PM

i am trying it on EKS cluster

tall-lock-23197

12/01/2022, 1:42 PM

https://github.com/flyteorg/flyte/blob/aae01aa33eadfb86f1c952eb415f21326ea5519b/charts/flyte-core/values-eks.yaml#L216 section specifies the task resource defaults.

tall-lock-23197

12/01/2022, 1:43 PM

Can you check yours? Please increasing the mem. I believe

kubectl -n flyte edit cm flyte-admin-base-config

is the command but I’m not very sure. Let me know if this doesn’t work.

future-notebook-79388

12/01/2022, 1:47 PM

message has been deleted

tall-lock-23197

12/01/2022, 1:48 PM

Nice. Please increase your mem and try again.

future-notebook-79388

12/02/2022, 4:46 AM

I increased the memory in task. the execution is getting queued but it is in pending state for long time. Even in remote run, the workflow is running for more than 4h for 4 trials but the execution is not happening.

tall-lock-23197

12/02/2022, 5:03 AM

Have you seen the message saying you asked for 3 cpu and 0 gpu but the cluster has 2 cpu and 0 gpu?

future-notebook-79388

12/02/2022, 5:05 AM

yes but i have requested for only 1 cpu. should i change anywhere else?

Copy code

@task(task_config=ray_config, limits=Resources(mem="5000Mi", cpu="1", ephemeral_storage="3000Mi"))

tall-lock-23197

12/02/2022, 5:10 AM

I think it’s because of

get_tune_resources()

tall-lock-23197

12/02/2022, 5:10 AM

Have you seen https://docs.ray.io/en/releases-1.11.0/ray-more-libs/xgboost-ray.html#memory-usage section in the doc?

tall-lock-23197

12/02/2022, 5:11 AM

I’m assuming you’re training an xgboost model.

future-notebook-79388

12/05/2022, 10:31 AM

Copy code

ray_config = RayJobConfig(
    head_node_config=HeadNodeConfig(ray_start_params={"log-color": "True"}),
    worker_node_config=[WorkerNodeConfig(group_name="ray-group", replicas=3)],
)

do we have any ways to specify the number of cpus in the ray cluster config? like this ?

Copy code

ray_config = RayJobConfig(
    head_node_config=HeadNodeConfig(ray_start_params={"log-color": "True"}),
    worker_node_config=[WorkerNodeConfig(group_name="ray-group", replicas=3)],
num_cpus=4,
)

future-notebook-79388

12/05/2022, 10:34 AM

bcause as mentioned above we have 64 cpus in eks cluster. but it shows this warning that we have only 2 cpus in ray cluster. how to increase the cpu limit in ray cluster config?

tall-lock-23197

12/05/2022, 12:20 PM

I believe you you can set them in

RayParams

tall-lock-23197

12/05/2022, 12:20 PM

https://github.com/ray-project/xgboost_ray/blob/ecca2c63385841a0a1938f5edc349893e5ac63fc/xgboost_ray/main.py

future-notebook-79388

12/05/2022, 1:31 PM

yeah but in RayParams we could specify the number of cpus that has to be utilized for each trial

cpus_per_actor

. Is there any config to be changed to increase the cpu of the ray cluster as a whole? bcause when I increased the cpus_per_actor also the requested cpu is still 2 and shows the warning that it has only 2 cpu in the cluster.

tall-lock-23197

12/05/2022, 1:46 PM

@glamorous-carpet-83516, any idea how we can set the ray cluster resources? As per the docs, it should be possible with

init()

, but in this case, since Flyte initializes the cluster, how can a user modify those values?

glamorous-carpet-83516

12/05/2022, 6:50 PM

To set Ray cluster resource, just update the

limit

and

request

in the @task. Like https://github.com/flyteorg/flytesnacks/blob/a3b97943563cfc952b5683525763578685a93[…]694/cookbook/integrations/kubernetes/ray_example/ray_example.py

future-notebook-79388

12/06/2022, 4:31 AM

Copy code

@task(task_config=ray_config, requests=Resources(mem="5000Mi", cpu="5", ephemeral_storage="1000Mi"), limits=Resources(mem="7000Mi", cpu="9", ephemeral_storage="2000Mi"))

I have requested for 5 cpus but when it executes it shows requested cpus as 2 only.

future-notebook-79388

12/06/2022, 4:37 AM

and show same warning too that we have only 2 cpu in the cluster.

tall-lock-23197

12/06/2022, 4:38 AM

I’m wondering where it’s picking “you asked for 9.0 cpu” from. Is it from your

limits

future-notebook-79388

12/06/2022, 4:48 AM

I think it is based on the resource requested per trial. when i specified cpus_per_trial and num_actors as 2 and 4 it showed requested cpus as 9. when i decreased the resource requested and num actors as 2 and 1 it showed 3.

future-notebook-79388

12/06/2022, 4:53 AM

When the cpus_per_trial and num_actors are 1, the actual requested cpu is 2 and the execution is happening fine since we have sufficient 2 cpus in the cluster. when the num_actors are increased it requests for more cpus so the execution is not happening.

tall-lock-23197

12/06/2022, 4:54 AM

Um got it. We need to find a way to increase the cluster resources. Not sure why

requests

isn’t assigning the requested resources to the cluster.

👀 1

future-notebook-79388

12/06/2022, 1:59 PM

yeah. kindly notify if there is any way to do so.

tall-lock-23197

12/08/2022, 5:09 AM

@glamorous-carpet-83516, do you have any ideas?

glamorous-carpet-83516

12/10/2022, 3:00 AM

@future-notebook-79388 Could you describe the RayJob (kubectl describe) and check if the resource is same as you specify in the @task. I guess the head node doesn’t use all the cpu in the pod. In other words, the cpu of head pod could be 10, but cpu of the head node process in the pod could be 2.

future-notebook-79388

12/12/2022, 5:12 AM

I have attached the allocated memory when we describe the node.

Copy code

@task(task_config=ray_config, requests=Resources(mem="5000Mi", cpu="5") , limits=Resources(mem="7000Mi", cpu="9"))

This is the requested resources.

glamorous-carpet-83516

12/12/2022, 8:22 AM

sorry, could you describe the rayJob you are running?

future-notebook-79388

12/12/2022, 9:25 AM

is there any command for this

future-notebook-79388

12/13/2022, 4:33 AM

This is the shown when we describe the kuberay-operator while running.

glamorous-carpet-83516

12/13/2022, 8:45 PM

kubectl describe RayJobs <name> -n <namespace>

163 Views

Open in Slack

Previous Next