Kamakshi Muthukrishnan
11/07/2022, 2:21 PMSamhita Alla
Kamakshi Muthukrishnan
11/08/2022, 5:17 AMSamhita Alla
Samhita Alla
Ketan (kumare3)
Ketan (kumare3)
Ketan (kumare3)
Ketan (kumare3)
Kamakshi Muthukrishnan
11/08/2022, 5:36 AMKamakshi Muthukrishnan
11/08/2022, 5:59 AMKamakshi Muthukrishnan
11/08/2022, 6:00 AM@task(task_config=RayJobConfig(...), requests=Resources(cpu="2"), limits=Resources(gpu="2"))
Kevin Su
11/08/2022, 5:59 PMtask_resource_defaults.yaml: |
task_resources:
defaults:
cpu: 400m
memory: 500Mi
storage: 500Mi
limits:
cpu: 2
gpu: 1
memory: 4Gi
storage: 20Mi
Kevin Su
11/08/2022, 6:03 PMcluster_resources:
customData:
- production:
- projectQuotaCpu:
value: "5"
- projectQuotaMemory:
value: 4000Mi
- staging:
- projectQuotaCpu:
value: "2"
- projectQuotaMemory:
value: 3000Mi
- development:
- projectQuotaCpu:
value: "12"
- projectQuotaMemory:
value: 8000Mi
karthikraj
11/10/2022, 2:32 AMPadma Priya M
11/10/2022, 4:21 AMimport typing
import ray
from ray import tune
from flytekit import Resources, task, workflow
from flytekitplugins.ray import HeadNodeConfig, RayJobConfig, WorkerNodeConfig
@ray.remote
def objective(config):
return (config["x"] * config["x"])
ray_config = RayJobConfig(
head_node_config=HeadNodeConfig(ray_start_params={"log-color": "True"}),
worker_node_config=[WorkerNodeConfig(group_name="ray-group", replicas=2)],
runtime_env={"pip": ["numpy", "pandas"]},
)
@task(task_config=ray_config, limits=Resources(mem="2000Mi", cpu="1"))
def ray_task(n: int) -> int:
model_params = {
"x": tune.randint(-10, 10)
}
tuner = tune.Tuner(
objective,
tune_config=tune.TuneConfig(
num_samples=10,
max_concurrent_trials=n,
),
param_space=model_params,
)
results = tuner.fit()
return results
@workflow
def ray_workflow(n: int) -> int:
return ray_task(n=n)
is there any other ways to run hyperparameter tuning in a distributed manner like ray tune?Kamakshi Muthukrishnan
11/15/2022, 1:28 PMkarthikraj
11/17/2022, 5:30 AMPlacement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {'CPU': 1.0, 'object_store_memory': 217143276.0, 'memory': 434412750.0, 'node:10.69.53.118': 0.98}, resources requested by the placement group: [{'CPU': 1.0}, {'CPU': 1.0}]
Dylan Wilder
11/18/2022, 9:36 PMPadma Priya M
11/21/2022, 5:18 AMPadma Priya M
11/29/2022, 2:57 PMfrom flytekitplugins.ray import HeadNodeConfig, RayJobConfig, WorkerNodeConfig
import ray
from ray import tune
#ray.init()
#ray.init("auto", ignore_reinit_error=True)
ray_config = RayJobConfig(
head_node_config=HeadNodeConfig(ray_start_params={"log-color": "True"}),
worker_node_config=[WorkerNodeConfig(group_name="ray-group", replicas=2)],
)
num_actors = 4
num_cpus_per_actor = 1
ray_params = RayParams(
num_actors=num_actors, cpus_per_actor=num_cpus_per_actor)
def train_model(config):
train_x, train_y = load_breast_cancer(return_X_y=True)
train_set = RayDMatrix(train_x, train_y)
evals_result = {}
bst = train(
params=config,
dtrain=train_set,
evals_result=evals_result,
evals=[(train_set, "train")],
verbose_eval=False,
ray_params=ray_params)
bst.save_model("model.xgb")
@task(task_config=ray_config, limits=Resources(mem="2000Mi", cpu="1"))
def train_model_task() -> dict:
config = {
"tree_method": "approx",
"objective": "binary:logistic",
"eval_metric": ["logloss", "error"],
"eta": tune.loguniform(1e-4, 1e-1),
"subsample": tune.uniform(0.5, 1.0),
"max_depth": tune.randint(1, 9)
}
analysis = tune.run(
train_model,
config=config,
metric="train-error",
mode="min",
num_samples=4,
resources_per_trial=ray_params.get_tune_resources())
return analysis.best_config
@workflow
def train_model_wf() -> dict:
return train_model_task()
Padma Priya M
12/08/2022, 3:11 PMHiromu Hota
12/20/2022, 9:21 PMttlSecondsAfterFinished
? By default, it is 3600s (1 hour) and we’d like to tear down a cluster right after a job is complete. Thanks for your help!
$ k describe rayjobs feb5da8c2a2394fb4ac8-n0-0 -n flytesnacks-development
...
Ttl Seconds After Finished: 3600
Kevin Su
12/21/2022, 8:50 AMPadma Priya M
12/22/2022, 1:50 PMpyflyte register --image <image name> <file name> --version <version number>
. I used the image built with required dependencies and the Kuberay version was 0.3.0.Padma Priya M
12/22/2022, 1:52 PMPadma Priya M
01/13/2023, 4:33 AMpyflyte --config ~/.flyte/config-remote.yaml run --remote --image <image_name> ray_demo.py wf
, I am getting this issue in logs and the task is getting queued in the console. When the same is executed in local using pyflyte --config ~/.flyte/config-remote.yaml run --image <image_name> ray_demo.py wf
, it works fine.Padma Priya M
01/16/2023, 5:48 AMRuksana Kabealo
01/30/2023, 8:42 PMMarcin Zieminski
02/23/2023, 8:55 PM2023-02-23T18:08:48.386Z INFO controllers.RayJob RayJob associated rayCluster found {"rayjob": "a8xqtvnds2sp7bgkn96k-fzvfpg5y-0", "raycluster": "ntropy-development/a8xqtvnds2sp7bgkn96k-fzvfpg5y-0-raycluster-xlg4f"}
2023-02-23T18:08:48.387Z INFO controllers.RayJob waiting for the cluster to be ready {"rayCluster": "a8xqtvnds2sp7bgkn96k-fzvfpg5y-0-raycluster-xlg4f"}
2023-02-23T18:08:51.387Z INFO controllers.RayJob reconciling RayJob {"NamespacedName": "ntropy-development/a8xqtvnds2sp7bgkn96k-fzvfpg5y-0"}
2023-02-23T18:08:51.388Z INFO controllers.RayJob RayJob associated rayCluster found {"rayjob": "a8xqtvnds2sp7bgkn96k-fzvfpg5y-0", "raycluster": "ntropy-development/a8xqtvnds2sp7bgkn96k-fzvfpg5y-0-raycluster-xlg4f"}
2023-02-23T18:08:51.388Z INFO controllers.RayJob waiting for the cluster to be ready {"rayCluster": "a8xqtvnds2sp7bgkn96k-fzvfpg5y-0-raycluster-xlg4f"}
2023-02-23T18:08:54.388Z INFO controllers.RayJob reconciling RayJob {"NamespacedName": "ntropy-development/a8xqtvnds2sp7bgkn96k-fzvfpg5y-0"}
2023-02-23T18:08:54.388Z INFO controllers.RayJob RayJob associated rayCluster found {"rayjob": "a8xqtvnds2sp7bgkn96k-fzvfpg5y-0", "raycluster": "ntropy-development/a8xqtvnds2sp7bgkn96k-fzvfpg5y-0-raycluster-xlg4f"}
2023-02-23T18:08:54.389Z INFO controllers.RayJob waiting for the cluster to be ready {"rayCluster": "a8xqtvnds2sp7bgkn96k-fzvfpg5y-0-raycluster-xlg4f"}
2023-02-23T18:08:57.389Z INFO controllers.RayJob reconciling RayJob {"NamespacedName": "ntropy-development/a8xqtvnds2sp7bgkn96k-fzvfpg5y-0"}
These logs seem to be generated by the this piece of code:
https://github.com/ray-project/kuberay/blob/89f5fba8d6f868f9fedde1fbe22a6eccad88ecc1/ray-operator/controllers/ray/rayjob_controller.go#L174
and are unexpected as the cluster is healthy and I can use it on the side.
I would appreciate any help and advice. Do you think the operator version?
My flyte deployment is in version: 1.2.1
Ray in cluster is 2.2.0
flytekitplugins-ray: 1.2.7