https://flyte.org logo
#ask-the-community
Title
# ask-the-community
t

Tarmily Wen

11/30/2022, 9:32 PM
Hello, I followed the ray k8s guide, and ran the example task. It has been queuing indefinitely for a day. I can't find any logs tellign me what the issue is. Does anyone know what the issue is or how to search for the problem?
k

Kevin Su

11/30/2022, 9:56 PM
It may because you don’t have enough resource in your cluster. Could you run
kubectl get pods -n flytesnack-development
to see if the status of ray worker pods is pending?
If so, you may need to increase your cluster cpu/mem, or reduce the worker cpus?
t

Tarmily Wen

12/01/2022, 5:44 PM
I definitely have enough resources available. I lowered the cpu and mem for the task and created larger node instances just to make sure
k

Kevin Su

12/01/2022, 5:56 PM
could you pass the log of kuberay operator? and
kubectl describe <ray_cluster>
t

Tarmily Wen

12/01/2022, 5:59 PM
I don't have a ray cluster. I thought flyte automatically creates the ray cluster as needed?
k

Kevin Su

12/01/2022, 6:03 PM
yes, flyte created a ray cluster in k8s. it’s a custom resource, so you can also get the logs from it. you should see some pods created for the ray cluster.
Copy code
-> kubectl get pods -n flytesnack-development
ray-head-...
ray-worker...
t

Tarmily Wen

12/01/2022, 6:06 PM
There are no pods. And I do see the crd from kubectl
k

Kevin Su

12/01/2022, 6:14 PM
could you pass the log of kuberay operator in ray-system namespace?
t

Tarmily Wen

12/01/2022, 6:18 PM
k

Kevin Su

12/01/2022, 6:41 PM
Copy code
pod name is too long: len = 57, we will shorten it by offset = 7",2022-12-
that’s a bug in kuberay.
which version of kuberay you’re using?
t

Tarmily Wen

12/01/2022, 6:50 PM
0.3.0
I see this issue has already been raised here. Then is there a work around for now?
k

Kevin Su

12/01/2022, 6:58 PM
use flyte remote to launch the workflow, and rename the execution
but I remember I fix it in kuberay, could you try the master branch of kuberay
t

Tarmily Wen

12/01/2022, 7:21 PM
after changing to master, I am still getting the same error
k

Kevin Su

12/01/2022, 8:58 PM
If you want to workaround it, you could use flytekit remote to trigger the workflow, and rename it? I’ll ask kuberay team if they could fix it in the next release
t

Tarmily Wen

12/01/2022, 10:13 PM
So I tried the remote execute and rename. There is a new error regarding ingress
This is still on the master branch instead of 0.3.0. Just tried it on v0.3.0 as well and it is the same error
k

Kevin Su

12/01/2022, 10:35 PM
kuberay just shows error, but doesn’t have any error message… mind sharing the code you’re running, I’ll dig into it later today.
t

Tarmily Wen

12/01/2022, 10:43 PM
Hi have you had a chance to look at this?
t

Tyler Su

12/05/2022, 5:16 PM
@Kevin Su
k

Kevin Su

12/05/2022, 10:11 PM
There is a new bug in kuberay on master branch. The issue is that the kuberay operator doesn’t update the status of Ray cluster, so the flyte assume that cluster isn’t started, and keep waiting for it.
just to confirm, have you seen this kind of error in kuberay operator?
Copy code
2022-12-05T22:18:30.460Z	INFO	controllers.RayCluster	reconcilePods	{"removing the pods in the scaleStrategy of": "test-group"}
2022-12-05T22:18:30.460Z	INFO	controllers.RayCluster	reconcilePods	{"all workers already exist for group": "test-group"}
2022-12-05T22:18:30.466Z	INFO	controllers.RayJob	UpdateState	{"oldJobStatus": "", "newJobStatus": "", "oldJobDeploymentStatus": "Initializing", "newJobDeploymentStatus": "FailedToGetJobStatus"}
2022-12-05T22:18:30.476Z	ERROR	controller.rayjob	Reconciler error	{"reconciler group": "<http://ray.io|ray.io>", "reconciler kind": "RayJob", "name": "test-1-n0-0", "namespace": "flytesnacks-development", "error": "Get \"<http://test-1-n0-0-raycluster-ccpp6-head-svc.flytesnacks-development.svc.cluster.local:8265/api/jobs/test-1-n0-0-c84r4>\": dial tcp 10.96.80.64:8265: connect: connection refused"}
<http://sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem|sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem>
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
<http://sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2|sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2>
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227
btw, have you installed
ray[default]
? it includes the dashboard dependencies.
t

Tarmily Wen

12/06/2022, 1:55 AM
oh i just have ray installed.
didn't know [default] was needed
k

Kevin Su

12/06/2022, 4:17 AM
sorry, that’s my bad. I forgot add it to setup.py. could you install it, and run the workflow again?
t

Tarmily Wen

12/06/2022, 4:44 PM
@Kevin Su I installed ray[default]. I am getting the same errors as before. Also I have not seen your error before. It doesn't surface on either master or v0.3.0 for me
k

Kevin Su

12/06/2022, 7:50 PM
Could you help me check if ray cluster is created and if the job status is still “queued”? If cluster isn’t created, try to increase the quota in the configmap
flyte-admin-base-config
.
Copy code
apiVersion: v1
data:
  cluster_resources.yaml: |
    cluster_resources:
      customData:
      - production:
        - projectQuotaCpu:
            value: "8"
        - projectQuotaMemory:
            value: 16Gi
      - staging:
        - projectQuotaCpu:
            value: "8"
        - projectQuotaMemory:
            value: 16Gi
      - development:
        - projectQuotaCpu:
            value: "8"
        - projectQuotaMemory:
            value: 16Gi
Copy code
RayJob associated rayCluster found
The logs indicate the cluster is created. I think for some reason, the kuberay operator failed to submit the job to the ray cluster. Could you show me the status of rayJob by using
kubectl describe RayJob <name> -n flytesnack-development
t

Tarmily Wen

12/06/2022, 8:33 PM
So it appears to be created but waiting for the dashboard. But I have ray[default] installed on that image it pulled which i confirmed by running ray.init() inside of that image which created a dashboard
k

Kevin Su

12/06/2022, 8:50 PM
It can’t get dashboard URL for some reason. are both workers and head node created?
t

Tarmily Wen

12/06/2022, 9:20 PM
There are no pods associated with the ray cluster. I did see that it created a node after running the job. And looking at the logs in the node I don't see why the pods aren't being created
when it creates pods for the ray head and workers what are the naming formats?
k

Kevin Su

12/06/2022, 9:29 PM
<Execution-name>-worker <Execution-name>-head
t

Tarmily Wen

12/06/2022, 9:29 PM
those pods aren't created
k

Kevin Su

12/06/2022, 9:32 PM
@Tarmily Wen do you have 5mins to hop on a call?
t

Tarmily Wen

12/06/2022, 10:29 PM
When would you be available for it?
d

David Espejo (he/him)

01/10/2023, 9:41 PM
Was this problem solved?
t

Tarmily Wen

01/10/2023, 9:41 PM
yes this has been solved
d

David Espejo (he/him)

01/10/2023, 9:43 PM
Thank you @Tarmily Wen! Any chance you remember how was it fixed?
t

Tarmily Wen

01/10/2023, 9:44 PM
It was fixed by making sure ray[default] was installed and installing the ingress component of the helm install