Hey I m following <https docs flyte org en latest deployment Flyte #ray-integration

Hey! I’m following <this> guide to configure Ray. ...

thankful-tailor-28399

09/14/2023, 9:59 AM

Hey! I’m following this guide to configure Ray. However, no ray cluster is being created when I launch tasks. Anything else I should configure? Permissions for example? Or do I need to install the API server when installing kuberay operator?

Started a local Ray instance. View the dashboard at <http://127.0.0.1:8265>

freezing-airport-6809

09/14/2023, 1:25 PM

This should be it. @tall-lock-23197 do you know

thankful-tailor-28399

09/14/2023, 1:25 PM

Using `kuberay `0.6.0``

tall-lock-23197

09/14/2023, 1:26 PM

Are you to trying to run Ray tasks on demo cluster?

thankful-tailor-28399

09/14/2023, 1:26 PM

Not in this case. On a real, EKS cluster

tall-lock-23197

09/14/2023, 1:27 PM

Then the instructions specified in the deployment guide should suffice.

tall-lock-23197

09/14/2023, 1:27 PM

So no head or worker pods are spinning up?

thankful-tailor-28399

09/14/2023, 1:39 PM

Nope. This is how I’m running this just in case

Copy code

ray_config = RayJobConfig(
    head_node_config=HeadNodeConfig(ray_start_params={"log-color": "True"}),
    worker_node_config=[
        WorkerNodeConfig(
            group_name="ray-group",
            replicas=2,
        )
    ],
)
ray_resources = Resources(mem="8Gi", cpu="4")

@task(task_config=ray_config, limits=ray_resources)
def train_xgboost(num_workers: int, use_gpu: bool = False) -> dict:
    train_dataset, valid_dataset, _ = prepare_data()

    # Scale some random columns
    columns_to_scale = ["mean radius", "mean texture"]
    preprocessor = StandardScaler(columns=columns_to_scale)

    # XGBoost specific params
    params = {
        "tree_method": "approx",
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
    }

    trainer = XGBoostTrainer(
        scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
        label_column="target",
        params=params,
        datasets={"train": train_dataset, "valid": valid_dataset},
        preprocessor=preprocessor,
        num_boost_round=100,
    )
    result = trainer.fit()
    print(result.metrics)
    return result.metrics

thankful-tailor-28399

09/14/2023, 1:40 PM

I’m checking: • pods in the workflow’s namespace • rayclusters in all namespaces Can’t see anything. Yet the workflow runs okay

tall-lock-23197

09/14/2023, 1:43 PM

Do you see a kuberay-operator pod?

thankful-tailor-28399

09/14/2023, 1:44 PM

flyte

namespace. Not in the workflow’s namespace

tall-lock-23197

09/14/2023, 1:44 PM

Yes, can you check its logs?

thankful-tailor-28399

09/14/2023, 1:50 PM

Nothing strange. Only a readyness failed, but it passed and I checked the pod description, it’s ready:

Copy code

2023-09-14T13:46:32.489Z	INFO	controllers.RayCluster	no ray node pod found for event	{"event": "&Event{ObjectMeta:{kuberay-operator-6b68b5b49d-tnnxx.1784c7eb4fd2b844  flyte  07ecf1a3-2327-4f4f-96c5-5726c28249fd 305718066 0 2023-09-14 13:46:13 +0000 UTC <nil> <nil> map[] map[] [] []  [{kubelet Update v1 2023-09-14 13:46:13 +0000 UTC FieldsV1 {\"f:count\":{},\"f:firstTimestamp\":{},\"f:involvedObject\":{},\"f:lastTimestamp\":{},\"f:message\":{},\"f:reason\":{},\"f:source\":{\"f:component\":{},\"f:host\":{}},\"f:type\":{}} }]},InvolvedObject:ObjectReference{Kind:Pod,Namespace:flyte,Name:kuberay-operator-6b68b5b49d-tnnxx,UID:720dc81f-a958-493c-b061-3ac971556e15,APIVersion:v1,ResourceVersion:305193883,FieldPath:spec.containers{kuberay-operator},},Reason:Unhealthy,Message:Readiness probe failed: Get \"<http://10.194.52.36:8080/metrics>\": dial tcp 10.194.52.36:8080: connect: connection refused,Source:EventSource{Component:kubelet,Host:ip-10-194-2-132.eu-west-1.compute.internal,},FirstTimestamp:2023-09-14 13:46:13 +0000 UTC,LastTimestamp:2023-09-14 13:46:18 +0000 UTC,Count:2,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}"}
2023-09-14T13:46:32.489Z	INFO	controller.rayjob	Starting workers	{"reconciler group": "<http://ray.io|ray.io>", "reconciler kind": "RayJob", "worker count": 1}

thankful-tailor-28399

09/14/2023, 1:51 PM

Anything related to RBAC I should check? We’re installing flyte using kustomize and not the chart, so maybe there’s something I should add/change

tall-lock-23197

09/14/2023, 2:56 PM

Oh I'm not sure. @glamorous-carpet-83516, do you've any idea?

acceptable-window-92672

09/14/2023, 3:37 PM

For Kuberay, the default Helm chart creates the ClusterRole and ClusterRoleBinding, allowing it to watch custom resources in all namespaces. To debug, can you see the rayJob CR(

kubectl get RayJob -A

glamorous-carpet-83516

09/14/2023, 4:48 PM

yes, could you describe rajob. (

kubectl describe Rayjob <name>

)

thankful-tailor-28399

09/15/2023, 7:46 AM

So. I reinstalled everything from scratch, and ray clusters and workers pods are being launched. There are two things I had to sort out (not a problem from the plugin, sharing this in case there is something we should review): • version 0.6.0 has a problem when running rayjobs. The jobs are created, but also K8s jobs are launched and they don’t start as they don’t get a

resource.limits

defined . I installed

0.5.2

and it seems better now • Is this correct? Getting an OOM error with one workflow (another one works), and taking a look found this

glamorous-carpet-83516

09/15/2023, 5:14 PM

Is this correct? Getting an OOM error with one workflow

those envs are defined by ray

👍 1

some-solstice-93243

10/21/2023, 10:08 AM

Hi, I have a very similar (if not identical) case: • Had kuberay-operator 0.6.0 installed, downgraded to 0.5.2. • No Ray-related pods are created, no RayCluster/RayJob resources are created in any namespace when Flyte task is running. • Flyte workflow passes without errors. There is nothing logged in the kuberay-operator logs when the workflow is running. In the Flyte job's logs, I have:

Copy code

2023-10-21 10:06:45,617	WARNING services.py:1889 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=0.22gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2023-10-21 10:06:46,702	INFO worker.py:1633 -- Started a local Ray instance. View the dashboard at <http://127.0.0.1:8265>

I would appreciate any pointers on what to check. Kuberay is deployed via Helm, and so is Flyte. I am running on a microk8s cluster.

some-solstice-93243

10/21/2023, 10:24 AM

BTW, I can create RayClusters manually.

acceptable-window-92672

10/21/2023, 7:20 PM

Kuberay already adds /dev/shm volumeMount for the object store to avoid performance degradation. See here: https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/common/pod.go#L272. It can use as much memory as is available on the node. So, the reason might be you are running on a microk8s cluster which only have 64MB memory.

tall-lock-23197

10/23/2023, 1:44 PM

@some-solstice-93243, how have you installed the plugin? Could you let us know the commands you ran? cc @glamorous-carpet-83516

some-solstice-93243

10/24/2023, 1:25 PM

I went through https://docs.flyte.org/projects/cookbook/en/stable/auto_examples/ray_plugin/index.html#install-the-plugin and https://docs.flyte.org/en/latest/deployment/plugins/k8s/index.html#deployment-plugin-setup-k8s. I am using

flyte-binary

version installed using Helm.

kuberay

is also installed using Helm (i have been using it before deploying Flyte). I think the

/dev/shm

warning was related to the fact that no `RayCluster`s were started, tasks were running in the flytekit container. Naturally I have more than 64MB memory per node available (16GB). But @acceptable-window-92672’s reply indirectly helped me. When I visited

kuberay

repo, I have found out that version

v1.0.0-rc.1

is available. I gave it a try and... it just worked. It bothers me a bit, because I have no idea why it did not work neither on

0.5.2

nor

0.6.0

, I would love to know what to check (did not find any logs, k8s events, etc.). But at least it is working now, so I can follow with my PoC. If you will have any advice regarding debugging steps, I will gladly try them though!

tall-lock-23197

10/24/2023, 3:26 PM

A detailed explanation is included in this PR description. Would you mind leaving a comment on the issue saying 1.0 works as is?

👍 1

some-solstice-93243

10/26/2023, 3:38 PM

Just did: https://github.com/flyteorg/flyte/issues/4244#issuecomment-1781373117. Thanks!

🙏 1

12 Views

Open in Slack

Previous Next