Hi, I'm having some issues with submitting my firs...
# flyte-deployment
p
Hi, I'm having some issues with submitting my first RayJob with Flyte 🧡
I'm using flyte-binary v1.14.1 and kuberay-operator v1.1.0. I'm using the example from the docs and when running on the cluster I find that my rayjob is hanging during intialization:
Copy code
NAME                                 JOB STATUS   DEPLOYMENT STATUS   RAY CLUSTER NAME                                      START TIME             END TIME   AGE
awl7lx78t947ldcrc565-testraytask-0                Initializing        awl7lx78t947ldcrc565-testraytask-0-raycluster-dkwps   2025-02-10T22:46:20Z              4m54s
Further investigation:
Copy code
NAME                                                  DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGE
awl7lx78t947ldcrc565-testraytask-0-raycluster-dkwps                                         0      0        0      failed   5m12s
Describing the RayCluster shows no obvious events
Copy code
Status:
  Desired CPU:     0
  Desired GPU:     0
  Desired Memory:  0
  Desired TPU:     0
  Head:
  State:  failed
Events:
  Type    Reason   Age    From                   Message
  ----    ------   ----   ----                   -------
  Normal  Created  6m31s  raycluster-controller  Created service account wl7lx78t947ldcrc565-testraytask-0-raycluster-dkwps
  Normal  Created  6m31s  raycluster-controller  Created role wl7lx78t947ldcrc565-testraytask-0-raycluster-dkwps
  Normal  Created  6m31s  raycluster-controller  Created role binding wl7lx78t947ldcrc565-testraytask-0-raycluster-dkwps
  Normal  Created  6m31s  raycluster-controller  Created ingress crc565-testraytask-0-raycluster-dkwps-head-ingress
Checking the logs of the kuberay operator:
Copy code
{"level":"error","ts":"2025-02-10T22:51:48.356Z","logger":"controllers.RayCluster","msg":"Pod Service create error!","RayCluster":{"name":"awl7lx78t947ldcrc565-testraytask-0-raycluster-dkwps","namespace":"fl97"},"reconcileID":"dacf02b6-21ae-4118-a80d-d73e15c70c7c","Pod.Service.Error":"Service \"r7ldcrc565-testraytask-0-raycluster-dkwps-head-svc\" is invalid: [spec.ports[3].nodePort: Duplicate value: 31517, spec.ports[3]: Duplicate value: core.ServicePort{Name:\"\", Protocol:\"TCP\", AppProtocol:(*string)(nil), Port:8080, TargetPort:intstr.IntOrString{Type:0, IntVal:0, StrVal:\"\"}, NodePort:0}]","error":"Service \"r7ldcrc565-testraytask-0-raycluster-dkwps-head-svc\" is invalid: [spec.ports[3].nodePort: Duplicate value: 31517, spec.ports[3]: Duplicate value: core.ServicePort{Name:\"\", Protocol:\"TCP\", AppProtocol:(*string)(nil), Port:8080, TargetPort:intstr.IntOrString{Type:0, IntVal:0, StrVal:\"\"}, NodePort:0}]","stacktrace":"<http://github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayClusterReconciler).createService|github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayClusterReconciler).createService>\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/raycluster_controller.go:1002\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayClusterReconciler).reconcileHeadService\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/raycluster_controller.go:549\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayClusterReconciler).rayClusterReconcile\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/raycluster_controller.go:330\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayClusterReconciler).Reconcile\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/raycluster_controller.go:169\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}
{"level":"info","ts":"2025-02-10T22:51:48.356Z","logger":"controllers.RayCluster","msg":"Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes reqeueuing with exponential backoff. For more details, see: <https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler>","RayCluster":{"name":"awl7lx78t947ldcrc565-testraytask-0-raycluster-dkwps","namespace":"fl97"},"reconcileID":"dacf02b6-21ae-4118-a80d-d73e15c70c7c"}
{"level":"error","ts":"2025-02-10T22:51:48.356Z","logger":"controllers.RayCluster","msg":"Reconciler error","RayCluster":{"name":"awl7lx78t947ldcrc565-testraytask-0-raycluster-dkwps","namespace":"fl97"},"reconcileID":"dacf02b6-21ae-4118-a80d-d73e15c70c7c","error":"Service \"r7ldcrc565-testraytask-0-raycluster-dkwps-head-svc\" is invalid: [spec.ports[3].nodePort: Duplicate value: 31517, spec.ports[3]: Duplicate value: core.ServicePort{Name:\"\", Protocol:\"TCP\", AppProtocol:(*string)(nil), Port:8080, TargetPort:intstr.IntOrString{Type:0, IntVal:0, StrVal:\"\"}, NodePort:0}]","stacktrace":"<http://sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler|sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler>\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}
Extracting the relevant text:
Copy code
"Pod.Service.Error":"Service \"r7ldcrc565-testraytask-0-raycluster-dkwps-head-svc\" is invalid: [spec.ports[3].nodePort: Duplicate value: 31517, spec.ports[3]: Duplicate value: core.ServicePort{Name:\"\", Protocol:\"TCP\", AppProtocol:(*string)(nil), Port:8080, TargetPort:intstr.IntOrString{Type:0, IntVal:0, StrVal:\"\"}, NodePort:0}]","error":"Service \"r7ldcrc565-testraytask-0-raycluster-dkwps-head-svc\" is invalid: [spec.ports[3].nodePort: Duplicate value: 31517, spec.ports[3]: Duplicate value: core.ServicePort{Name:\"\", Protocol:\"TCP\", AppProtocol:(*string)(nil), Port:8080, TargetPort:intstr.IntOrString{Type:0, IntVal:0, StrVal:\"\"}, NodePort:0}]"
It looks like there's an issue with service creation? Is there another version of kuberay that I should be deploying that doesn't have this issue? I'm in an on-prem k8s cluster (RKE2)
I've been searching around for a solution but I can't seem to find one.
g
s invalid: [spec.ports[3].nodePort: Duplicate value: 31517, spec.ports[3]: Duplicate value: core.ServicePort{Name:\β€œ\”, Protocol:\β€œTCP\β€œ, AppProtocol:(*string)(nil), Port:8080, TargetPortintstr.IntOrString{Type0, IntVal:0, StrVal:\β€œ\”}, NodePort0}]β€œ,”stacktraceβ€β€œsigs.k8s.io/controller-runtime/pkg/internal/controller.
are you using nodePort for the RayCluster?
p
Not intentionally, I'm letting Flyte create it
Some additional findings. I couldn't find a way to modify the kubray-operator's helm values, so I made a mutating webhook configuration that modifies the raycluster
serviceType
from
NodePort
to
ClusterIP
. When I do this I still get the same
FailedToCreateService
error:
Failed creating service fl97/brzkjkfl9g-testraytask-0-raycluster-snpj2-head-svc, Service "brzkjkfl9g-testraytask-0-raycluster-snpj2-head-svc" is invalid: spec.ports[3]: Duplicate value: core.ServicePort{Name:"", Protocol:"TCP", AppProtocol:(*string)(nil), Port:8080, TargetPort:intstr.IntOrString{Type:0, IntVal:0, StrVal:""}, NodePort:0}
Instead of multiple of these services failing, it's just one. Does anyone know how to modify the kubray-operator to stop creating nodeports?
Another Update, I got it working! Sort of! The issue was that the RayCluster head node was being generated with this port:
Copy code
- containerPort: 8080
    name: http
    protocol: TCP
Modifying my webhook to remove that port fixed the issue. Not sure why Flyte does this. Additionally, the example given breaks immediately because the default flyte pod doesn't have ray! So now I have to fix the example and figure out the containers for everything
The issue is all my fault! I have a template that automatically creates that port, and so I've shot myself in the foot.