Hello, I followed the ray k8s <guide>, and ran the...
# flyte-support
d
Hello, I followed the ray k8s guide, and ran the example task. It has been queuing indefinitely for a day. I can't find any logs tellign me what the issue is. Does anyone know what the issue is or how to search for the problem?
g
It may because you don’t have enough resource in your cluster. Could you run
kubectl get pods -n flytesnack-development
to see if the status of ray worker pods is pending?
If so, you may need to increase your cluster cpu/mem, or reduce the worker cpus?
d
I definitely have enough resources available. I lowered the cpu and mem for the task and created larger node instances just to make sure
g
could you pass the log of kuberay operator? and
kubectl describe <ray_cluster>
d
I don't have a ray cluster. I thought flyte automatically creates the ray cluster as needed?
g
yes, flyte created a ray cluster in k8s. it’s a custom resource, so you can also get the logs from it. you should see some pods created for the ray cluster.
Copy code
-> kubectl get pods -n flytesnack-development
ray-head-...
ray-worker...
d
There are no pods. And I do see the crd from kubectl
g
could you pass the log of kuberay operator in ray-system namespace?
d
g
Copy code
pod name is too long: len = 57, we will shorten it by offset = 7",2022-12-
that’s a bug in kuberay.
which version of kuberay you’re using?
d
0.3.0
I see this issue has already been raised here. Then is there a work around for now?
g
use flyte remote to launch the workflow, and rename the execution
but I remember I fix it in kuberay, could you try the master branch of kuberay
d
after changing to master, I am still getting the same error
g
If you want to workaround it, you could use flytekit remote to trigger the workflow, and rename it? I’ll ask kuberay team if they could fix it in the next release
d
So I tried the remote execute and rename. There is a new error regarding ingress
This is still on the master branch instead of 0.3.0. Just tried it on v0.3.0 as well and it is the same error
g
kuberay just shows error, but doesn’t have any error message… mind sharing the code you’re running, I’ll dig into it later today.
d
Hi have you had a chance to look at this?
w
@glamorous-carpet-83516
g
There is a new bug in kuberay on master branch. The issue is that the kuberay operator doesn’t update the status of Ray cluster, so the flyte assume that cluster isn’t started, and keep waiting for it.
just to confirm, have you seen this kind of error in kuberay operator?
Copy code
2022-12-05T22:18:30.460Z	INFO	controllers.RayCluster	reconcilePods	{"removing the pods in the scaleStrategy of": "test-group"}
2022-12-05T22:18:30.460Z	INFO	controllers.RayCluster	reconcilePods	{"all workers already exist for group": "test-group"}
2022-12-05T22:18:30.466Z	INFO	controllers.RayJob	UpdateState	{"oldJobStatus": "", "newJobStatus": "", "oldJobDeploymentStatus": "Initializing", "newJobDeploymentStatus": "FailedToGetJobStatus"}
2022-12-05T22:18:30.476Z	ERROR	controller.rayjob	Reconciler error	{"reconciler group": "<http://ray.io|ray.io>", "reconciler kind": "RayJob", "name": "test-1-n0-0", "namespace": "flytesnacks-development", "error": "Get \"<http://test-1-n0-0-raycluster-ccpp6-head-svc.flytesnacks-development.svc.cluster.local:8265/api/jobs/test-1-n0-0-c84r4>\": dial tcp 10.96.80.64:8265: connect: connection refused"}
<http://sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem|sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem>
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
<http://sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2|sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2>
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227
btw, have you installed
ray[default]
? it includes the dashboard dependencies.
d
oh i just have ray installed.
didn't know [default] was needed
g
sorry, that’s my bad. I forgot add it to setup.py. could you install it, and run the workflow again?
d
@glamorous-carpet-83516 I installed ray[default]. I am getting the same errors as before. Also I have not seen your error before. It doesn't surface on either master or v0.3.0 for me
g
Could you help me check if ray cluster is created and if the job status is still “queued”? If cluster isn’t created, try to increase the quota in the configmap
flyte-admin-base-config
.
Copy code
apiVersion: v1
data:
  cluster_resources.yaml: |
    cluster_resources:
      customData:
      - production:
        - projectQuotaCpu:
            value: "8"
        - projectQuotaMemory:
            value: 16Gi
      - staging:
        - projectQuotaCpu:
            value: "8"
        - projectQuotaMemory:
            value: 16Gi
      - development:
        - projectQuotaCpu:
            value: "8"
        - projectQuotaMemory:
            value: 16Gi
Copy code
RayJob associated rayCluster found
The logs indicate the cluster is created. I think for some reason, the kuberay operator failed to submit the job to the ray cluster. Could you show me the status of rayJob by using
kubectl describe RayJob <name> -n flytesnack-development
d
So it appears to be created but waiting for the dashboard. But I have ray[default] installed on that image it pulled which i confirmed by running ray.init() inside of that image which created a dashboard
g
It can’t get dashboard URL for some reason. are both workers and head node created?
d
There are no pods associated with the ray cluster. I did see that it created a node after running the job. And looking at the logs in the node I don't see why the pods aren't being created
when it creates pods for the ray head and workers what are the naming formats?
g
<Execution-name>-worker <Execution-name>-head
d
those pods aren't created
g
@delightful-computer-49028 do you have 5mins to hop on a call?
d
When would you be available for it?
a
Was this problem solved?
d
yes this has been solved
a
Thank you @delightful-computer-49028! Any chance you remember how was it fixed?
d
It was fixed by making sure ray[default] was installed and installing the ingress component of the helm install
🙏 2
179 Views