Hello I followed the ray k8s <https docs flyte org en latest Flyte #flyte-support

Hello, I followed the ray k8s <guide>, and ran the...

delightful-computer-49028

11/30/2022, 9:32 PM

Hello, I followed the ray k8s guide, and ran the example task. It has been queuing indefinitely for a day. I can't find any logs tellign me what the issue is. Does anyone know what the issue is or how to search for the problem?

glamorous-carpet-83516

11/30/2022, 9:56 PM

It may because you don’t have enough resource in your cluster. Could you run

kubectl get pods -n flytesnack-development

to see if the status of ray worker pods is pending?

glamorous-carpet-83516

11/30/2022, 9:57 PM

If so, you may need to increase your cluster cpu/mem, or reduce the worker cpus?

delightful-computer-49028

12/01/2022, 5:44 PM

I definitely have enough resources available. I lowered the cpu and mem for the task and created larger node instances just to make sure

glamorous-carpet-83516

12/01/2022, 5:56 PM

could you pass the log of kuberay operator? and

kubectl describe <ray_cluster>

delightful-computer-49028

12/01/2022, 5:59 PM

I don't have a ray cluster. I thought flyte automatically creates the ray cluster as needed?

glamorous-carpet-83516

12/01/2022, 6:03 PM

yes, flyte created a ray cluster in k8s. it’s a custom resource, so you can also get the logs from it. you should see some pods created for the ray cluster.

Copy code

-> kubectl get pods -n flytesnack-development
ray-head-...
ray-worker...

delightful-computer-49028

12/01/2022, 6:06 PM

There are no pods. And I do see the crd from kubectl

glamorous-carpet-83516

12/01/2022, 6:14 PM

could you pass the log of kuberay operator in ray-system namespace?

delightful-computer-49028

12/01/2022, 6:18 PM

downloaded-logs-20221201-131814.csv

glamorous-carpet-83516

12/01/2022, 6:41 PM

Copy code

pod name is too long: len = 57, we will shorten it by offset = 7",2022-12-

glamorous-carpet-83516

12/01/2022, 6:42 PM

that’s a bug in kuberay.

glamorous-carpet-83516

12/01/2022, 6:42 PM

which version of kuberay you’re using?

delightful-computer-49028

12/01/2022, 6:50 PM

0.3.0

delightful-computer-49028

12/01/2022, 6:56 PM

I see this issue has already been raised here. Then is there a work around for now?

glamorous-carpet-83516

12/01/2022, 6:58 PM

use flyte remote to launch the workflow, and rename the execution

glamorous-carpet-83516

12/01/2022, 6:59 PM

but I remember I fix it in kuberay, could you try the master branch of kuberay

delightful-computer-49028

12/01/2022, 7:21 PM

after changing to master, I am still getting the same error

glamorous-carpet-83516

12/01/2022, 8:58 PM

If you want to workaround it, you could use flytekit remote to trigger the workflow, and rename it? I’ll ask kuberay team if they could fix it in the next release

delightful-computer-49028

12/01/2022, 10:13 PM

So I tried the remote execute and rename. There is a new error regarding ingress

delightful-computer-49028

12/01/2022, 10:13 PM

downloaded-logs-20221201-171225.csv

delightful-computer-49028

12/01/2022, 10:13 PM

This is still on the master branch instead of 0.3.0. Just tried it on v0.3.0 as well and it is the same error

glamorous-carpet-83516

12/01/2022, 10:35 PM

kuberay just shows error, but doesn’t have any error message… mind sharing the code you’re running, I’ll dig into it later today.

delightful-computer-49028

12/01/2022, 10:43 PM

src.zip

delightful-computer-49028

12/05/2022, 10:28 AM

Hi have you had a chance to look at this?

wonderful-afternoon-77766

12/05/2022, 5:16 PM

@glamorous-carpet-83516

glamorous-carpet-83516

12/05/2022, 10:11 PM

There is a new bug in kuberay on master branch. The issue is that the kuberay operator doesn’t update the status of Ray cluster, so the flyte assume that cluster isn’t started, and keep waiting for it.

glamorous-carpet-83516

12/05/2022, 10:29 PM

just to confirm, have you seen this kind of error in kuberay operator?

Copy code

2022-12-05T22:18:30.460Z	INFO	controllers.RayCluster	reconcilePods	{"removing the pods in the scaleStrategy of": "test-group"}
2022-12-05T22:18:30.460Z	INFO	controllers.RayCluster	reconcilePods	{"all workers already exist for group": "test-group"}
2022-12-05T22:18:30.466Z	INFO	controllers.RayJob	UpdateState	{"oldJobStatus": "", "newJobStatus": "", "oldJobDeploymentStatus": "Initializing", "newJobDeploymentStatus": "FailedToGetJobStatus"}
2022-12-05T22:18:30.476Z	ERROR	controller.rayjob	Reconciler error	{"reconciler group": "<http://ray.io|ray.io>", "reconciler kind": "RayJob", "name": "test-1-n0-0", "namespace": "flytesnacks-development", "error": "Get \"<http://test-1-n0-0-raycluster-ccpp6-head-svc.flytesnacks-development.svc.cluster.local:8265/api/jobs/test-1-n0-0-c84r4>\": dial tcp 10.96.80.64:8265: connect: connection refused"}
<http://sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem|sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem>
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
<http://sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2|sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2>
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227

glamorous-carpet-83516

12/05/2022, 10:41 PM

btw, have you installed

ray[default]

? it includes the dashboard dependencies.

delightful-computer-49028

12/06/2022, 1:55 AM

oh i just have ray installed.

delightful-computer-49028

12/06/2022, 1:56 AM

didn't know [default] was needed

glamorous-carpet-83516

12/06/2022, 4:17 AM

sorry, that’s my bad. I forgot add it to setup.py. could you install it, and run the workflow again?

delightful-computer-49028

12/06/2022, 4:44 PM

@glamorous-carpet-83516 I installed ray[default]. I am getting the same errors as before. Also I have not seen your error before. It doesn't surface on either master or v0.3.0 for me

glamorous-carpet-83516

12/06/2022, 7:50 PM

Could you help me check if ray cluster is created and if the job status is still “queued”? If cluster isn’t created, try to increase the quota in the configmap

flyte-admin-base-config

Copy code

apiVersion: v1
data:
  cluster_resources.yaml: |
    cluster_resources:
      customData:
      - production:
        - projectQuotaCpu:
            value: "8"
        - projectQuotaMemory:
            value: 16Gi
      - staging:
        - projectQuotaCpu:
            value: "8"
        - projectQuotaMemory:
            value: 16Gi
      - development:
        - projectQuotaCpu:
            value: "8"
        - projectQuotaMemory:
            value: 16Gi

glamorous-carpet-83516

12/06/2022, 7:56 PM

Copy code

RayJob associated rayCluster found

The logs indicate the cluster is created. I think for some reason, the kuberay operator failed to submit the job to the ray cluster. Could you show me the status of rayJob by using

kubectl describe RayJob <name> -n flytesnack-development

delightful-computer-49028

12/06/2022, 8:33 PM

So it appears to be created but waiting for the dashboard. But I have ray[default] installed on that image it pulled which i confirmed by running ray.init() inside of that image which created a dashboard

ray_logs.yaml

glamorous-carpet-83516

12/06/2022, 8:50 PM

It can’t get dashboard URL for some reason. are both workers and head node created?

delightful-computer-49028

12/06/2022, 9:20 PM

There are no pods associated with the ray cluster. I did see that it created a node after running the job. And looking at the logs in the node I don't see why the pods aren't being created

delightful-computer-49028

12/06/2022, 9:27 PM

when it creates pods for the ray head and workers what are the naming formats?

glamorous-carpet-83516

12/06/2022, 9:29 PM

<Execution-name>-worker <Execution-name>-head

delightful-computer-49028

12/06/2022, 9:29 PM

those pods aren't created

glamorous-carpet-83516

12/06/2022, 9:32 PM

@delightful-computer-49028 do you have 5mins to hop on a call?

delightful-computer-49028

12/06/2022, 10:29 PM

When would you be available for it?

average-finland-92144

01/10/2023, 9:41 PM

Was this problem solved?

delightful-computer-49028

01/10/2023, 9:41 PM

yes this has been solved

average-finland-92144

01/10/2023, 9:43 PM

Thank you @delightful-computer-49028! Any chance you remember how was it fixed?

delightful-computer-49028

01/10/2023, 9:44 PM

It was fixed by making sure ray[default] was installed and installing the ingress component of the helm install

🙏 2

184 Views

Open in Slack

Previous Next