Hey! I’m following <this> guide to configure Ray. ...
# ray-integration
f
Hey! I’m following this guide to configure Ray. However, no ray cluster is being created when I launch tasks. Anything else I should configure? Permissions for example? Or do I need to install the API server when installing kuberay operator?
Started a local Ray instance. View the dashboard at <http://127.0.0.1:8265>
k
This should be it. @Samhita Alla do you know
f
Using `kuberay `0.6.0``
s
Are you to trying to run Ray tasks on demo cluster?
f
Not in this case. On a real, EKS cluster
s
Then the instructions specified in the deployment guide should suffice.
So no head or worker pods are spinning up?
f
Nope. This is how I’m running this just in case
Copy code
ray_config = RayJobConfig(
    head_node_config=HeadNodeConfig(ray_start_params={"log-color": "True"}),
    worker_node_config=[
        WorkerNodeConfig(
            group_name="ray-group",
            replicas=2,
        )
    ],
)
ray_resources = Resources(mem="8Gi", cpu="4")

@task(task_config=ray_config, limits=ray_resources)
def train_xgboost(num_workers: int, use_gpu: bool = False) -> dict:
    train_dataset, valid_dataset, _ = prepare_data()

    # Scale some random columns
    columns_to_scale = ["mean radius", "mean texture"]
    preprocessor = StandardScaler(columns=columns_to_scale)

    # XGBoost specific params
    params = {
        "tree_method": "approx",
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
    }

    trainer = XGBoostTrainer(
        scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
        label_column="target",
        params=params,
        datasets={"train": train_dataset, "valid": valid_dataset},
        preprocessor=preprocessor,
        num_boost_round=100,
    )
    result = trainer.fit()
    print(result.metrics)
    return result.metrics
I’m checking: • pods in the workflow’s namespace • rayclusters in all namespaces Can’t see anything. Yet the workflow runs okay
s
Do you see a kuberay-operator pod?
f
In
flyte
namespace. Not in the workflow’s namespace
s
Yes, can you check its logs?
f
Nothing strange. Only a readyness failed, but it passed and I checked the pod description, it’s ready:
Copy code
2023-09-14T13:46:32.489Z	INFO	controllers.RayCluster	no ray node pod found for event	{"event": "&Event{ObjectMeta:{kuberay-operator-6b68b5b49d-tnnxx.1784c7eb4fd2b844  flyte  07ecf1a3-2327-4f4f-96c5-5726c28249fd 305718066 0 2023-09-14 13:46:13 +0000 UTC <nil> <nil> map[] map[] [] []  [{kubelet Update v1 2023-09-14 13:46:13 +0000 UTC FieldsV1 {\"f:count\":{},\"f:firstTimestamp\":{},\"f:involvedObject\":{},\"f:lastTimestamp\":{},\"f:message\":{},\"f:reason\":{},\"f:source\":{\"f:component\":{},\"f:host\":{}},\"f:type\":{}} }]},InvolvedObject:ObjectReference{Kind:Pod,Namespace:flyte,Name:kuberay-operator-6b68b5b49d-tnnxx,UID:720dc81f-a958-493c-b061-3ac971556e15,APIVersion:v1,ResourceVersion:305193883,FieldPath:spec.containers{kuberay-operator},},Reason:Unhealthy,Message:Readiness probe failed: Get \"<http://10.194.52.36:8080/metrics>\": dial tcp 10.194.52.36:8080: connect: connection refused,Source:EventSource{Component:kubelet,Host:ip-10-194-2-132.eu-west-1.compute.internal,},FirstTimestamp:2023-09-14 13:46:13 +0000 UTC,LastTimestamp:2023-09-14 13:46:18 +0000 UTC,Count:2,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}"}
2023-09-14T13:46:32.489Z	INFO	controller.rayjob	Starting workers	{"reconciler group": "<http://ray.io|ray.io>", "reconciler kind": "RayJob", "worker count": 1}
Anything related to RBAC I should check? We’re installing flyte using kustomize and not the chart, so maybe there’s something I should add/change
s
Oh I'm not sure. @Kevin Su, do you've any idea?
y
For Kuberay, the default Helm chart creates the ClusterRole and ClusterRoleBinding, allowing it to watch custom resources in all namespaces. To debug, can you see the rayJob CR(
kubectl get RayJob -A
)?
k
yes, could you describe rajob. (
kubectl describe Rayjob <name>
)
f
So. I reinstalled everything from scratch, and ray clusters and workers pods are being launched. There are two things I had to sort out (not a problem from the plugin, sharing this in case there is something we should review): • version 0.6.0 has a problem when running rayjobs. The jobs are created, but also K8s jobs are launched and they don’t start as they don’t get a
resource.limits
defined . I installed
0.5.2
and it seems better now • Is this correct? Getting an OOM error with one workflow (another one works), and taking a look found this
k
Is this correct? Getting an OOM error with one workflow
those envs are defined by ray
m
Hi, I have a very similar (if not identical) case: • Had kuberay-operator 0.6.0 installed, downgraded to 0.5.2. • No Ray-related pods are created, no RayCluster/RayJob resources are created in any namespace when Flyte task is running. • Flyte workflow passes without errors. There is nothing logged in the kuberay-operator logs when the workflow is running. In the Flyte job's logs, I have:
Copy code
2023-10-21 10:06:45,617	WARNING services.py:1889 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=0.22gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2023-10-21 10:06:46,702	INFO worker.py:1633 -- Started a local Ray instance. View the dashboard at <http://127.0.0.1:8265>
I would appreciate any pointers on what to check. Kuberay is deployed via Helm, and so is Flyte. I am running on a microk8s cluster.
BTW, I can create RayClusters manually.
y
Kuberay already adds /dev/shm volumeMount for the object store to avoid performance degradation. See here: https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/common/pod.go#L272. It can use as much memory as is available on the node. So, the reason might be you are running on a microk8s cluster which only have 64MB memory.
s
@Maciej Kopczyński, how have you installed the plugin? Could you let us know the commands you ran? cc @Kevin Su
m
I went through https://docs.flyte.org/projects/cookbook/en/stable/auto_examples/ray_plugin/index.html#install-the-plugin and https://docs.flyte.org/en/latest/deployment/plugins/k8s/index.html#deployment-plugin-setup-k8s. I am using
flyte-binary
version installed using Helm.
kuberay
is also installed using Helm (i have been using it before deploying Flyte). I think the
/dev/shm
warning was related to the fact that no `RayCluster`s were started, tasks were running in the flytekit container. Naturally I have more than 64MB memory per node available (16GB). But @Yicheng Lu’s reply indirectly helped me. When I visited
kuberay
repo, I have found out that version
v1.0.0-rc.1
is available. I gave it a try and... it just worked. It bothers me a bit, because I have no idea why it did not work neither on
0.5.2
nor
0.6.0
, I would love to know what to check (did not find any logs, k8s events, etc.). But at least it is working now, so I can follow with my PoC. If you will have any advice regarding debugging steps, I will gladly try them though!
s
A detailed explanation is included in this PR description. Would you mind leaving a comment on the issue saying 1.0 works as is?
m