Franco Bocci
09/14/2023, 9:59 AMStarted a local Ray instance. View the dashboard at <http://127.0.0.1:8265>
Ketan (kumare3)
Franco Bocci
09/14/2023, 1:25 PMSamhita Alla
Franco Bocci
09/14/2023, 1:26 PMSamhita Alla
Franco Bocci
09/14/2023, 1:39 PMray_config = RayJobConfig(
head_node_config=HeadNodeConfig(ray_start_params={"log-color": "True"}),
worker_node_config=[
WorkerNodeConfig(
group_name="ray-group",
replicas=2,
)
],
)
ray_resources = Resources(mem="8Gi", cpu="4")
@task(task_config=ray_config, limits=ray_resources)
def train_xgboost(num_workers: int, use_gpu: bool = False) -> dict:
train_dataset, valid_dataset, _ = prepare_data()
# Scale some random columns
columns_to_scale = ["mean radius", "mean texture"]
preprocessor = StandardScaler(columns=columns_to_scale)
# XGBoost specific params
params = {
"tree_method": "approx",
"objective": "binary:logistic",
"eval_metric": ["logloss", "error"],
}
trainer = XGBoostTrainer(
scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
label_column="target",
params=params,
datasets={"train": train_dataset, "valid": valid_dataset},
preprocessor=preprocessor,
num_boost_round=100,
)
result = trainer.fit()
print(result.metrics)
return result.metrics
Samhita Alla
Franco Bocci
09/14/2023, 1:44 PMflyte
namespace. Not in the workflow’s namespaceSamhita Alla
Franco Bocci
09/14/2023, 1:50 PM2023-09-14T13:46:32.489Z INFO controllers.RayCluster no ray node pod found for event {"event": "&Event{ObjectMeta:{kuberay-operator-6b68b5b49d-tnnxx.1784c7eb4fd2b844 flyte 07ecf1a3-2327-4f4f-96c5-5726c28249fd 305718066 0 2023-09-14 13:46:13 +0000 UTC <nil> <nil> map[] map[] [] [] [{kubelet Update v1 2023-09-14 13:46:13 +0000 UTC FieldsV1 {\"f:count\":{},\"f:firstTimestamp\":{},\"f:involvedObject\":{},\"f:lastTimestamp\":{},\"f:message\":{},\"f:reason\":{},\"f:source\":{\"f:component\":{},\"f:host\":{}},\"f:type\":{}} }]},InvolvedObject:ObjectReference{Kind:Pod,Namespace:flyte,Name:kuberay-operator-6b68b5b49d-tnnxx,UID:720dc81f-a958-493c-b061-3ac971556e15,APIVersion:v1,ResourceVersion:305193883,FieldPath:spec.containers{kuberay-operator},},Reason:Unhealthy,Message:Readiness probe failed: Get \"<http://10.194.52.36:8080/metrics>\": dial tcp 10.194.52.36:8080: connect: connection refused,Source:EventSource{Component:kubelet,Host:ip-10-194-2-132.eu-west-1.compute.internal,},FirstTimestamp:2023-09-14 13:46:13 +0000 UTC,LastTimestamp:2023-09-14 13:46:18 +0000 UTC,Count:2,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}"}
2023-09-14T13:46:32.489Z INFO controller.rayjob Starting workers {"reconciler group": "<http://ray.io|ray.io>", "reconciler kind": "RayJob", "worker count": 1}
Samhita Alla
Yicheng Lu
09/14/2023, 3:37 PMkubectl get RayJob -A
)?Kevin Su
09/14/2023, 4:48 PMkubectl describe Rayjob <name>
)Franco Bocci
09/15/2023, 7:46 AMresource.limits
defined . I installed 0.5.2
and it seems better now
• Is this correct? Getting an OOM error with one workflow (another one works), and taking a look found thisKevin Su
09/15/2023, 5:14 PMIs this correct? Getting an OOM error with one workflowthose envs are defined by ray
Maciej Kopczyński
10/21/2023, 10:08 AM2023-10-21 10:06:45,617 WARNING services.py:1889 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=0.22gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2023-10-21 10:06:46,702 INFO worker.py:1633 -- Started a local Ray instance. View the dashboard at <http://127.0.0.1:8265>
I would appreciate any pointers on what to check. Kuberay is deployed via Helm, and so is Flyte. I am running on a microk8s cluster.Yicheng Lu
10/21/2023, 7:20 PMSamhita Alla
Maciej Kopczyński
10/24/2023, 1:25 PMflyte-binary
version installed using Helm. kuberay
is also installed using Helm (i have been using it before deploying Flyte). I think the /dev/shm
warning was related to the fact that no `RayCluster`s were started, tasks were running in the flytekit container. Naturally I have more than 64MB memory per node available (16GB). But @Yicheng Lu’s reply indirectly helped me. When I visited kuberay
repo, I have found out that version v1.0.0-rc.1
is available. I gave it a try and... it just worked. It bothers me a bit, because I have no idea why it did not work neither on 0.5.2
nor 0.6.0
, I would love to know what to check (did not find any logs, k8s events, etc.). But at least it is working now, so I can follow with my PoC. If you will have any advice regarding debugging steps, I will gladly try them though!Samhita Alla
Maciej Kopczyński
10/26/2023, 3:38 PM