Tarmily Wen
11/30/2022, 9:32 PMKevin Su
11/30/2022, 9:56 PMkubectl get pods -n flytesnack-development
to see if the status of ray worker pods is pending?Tarmily Wen
12/01/2022, 5:44 PMKevin Su
12/01/2022, 5:56 PMkubectl describe <ray_cluster>
Tarmily Wen
12/01/2022, 5:59 PMKevin Su
12/01/2022, 6:03 PM-> kubectl get pods -n flytesnack-development
ray-head-...
ray-worker...
Tarmily Wen
12/01/2022, 6:06 PMKevin Su
12/01/2022, 6:14 PMTarmily Wen
12/01/2022, 6:18 PMKevin Su
12/01/2022, 6:41 PMpod name is too long: len = 57, we will shorten it by offset = 7",2022-12-
Tarmily Wen
12/01/2022, 6:50 PMKevin Su
12/01/2022, 6:58 PMTarmily Wen
12/01/2022, 7:21 PMKevin Su
12/01/2022, 8:58 PMTarmily Wen
12/01/2022, 10:13 PMKevin Su
12/01/2022, 10:35 PMTarmily Wen
12/01/2022, 10:43 PMTyler Su
12/05/2022, 5:16 PMKevin Su
12/05/2022, 10:11 PM2022-12-05T22:18:30.460Z INFO controllers.RayCluster reconcilePods {"removing the pods in the scaleStrategy of": "test-group"}
2022-12-05T22:18:30.460Z INFO controllers.RayCluster reconcilePods {"all workers already exist for group": "test-group"}
2022-12-05T22:18:30.466Z INFO controllers.RayJob UpdateState {"oldJobStatus": "", "newJobStatus": "", "oldJobDeploymentStatus": "Initializing", "newJobDeploymentStatus": "FailedToGetJobStatus"}
2022-12-05T22:18:30.476Z ERROR controller.rayjob Reconciler error {"reconciler group": "<http://ray.io|ray.io>", "reconciler kind": "RayJob", "name": "test-1-n0-0", "namespace": "flytesnacks-development", "error": "Get \"<http://test-1-n0-0-raycluster-ccpp6-head-svc.flytesnacks-development.svc.cluster.local:8265/api/jobs/test-1-n0-0-c84r4>\": dial tcp 10.96.80.64:8265: connect: connection refused"}
<http://sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem|sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem>
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
<http://sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2|sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2>
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227
ray[default]
? it includes the dashboard dependencies.Tarmily Wen
12/06/2022, 1:55 AMKevin Su
12/06/2022, 4:17 AMTarmily Wen
12/06/2022, 4:44 PMKevin Su
12/06/2022, 7:50 PMflyte-admin-base-config
.
apiVersion: v1
data:
cluster_resources.yaml: |
cluster_resources:
customData:
- production:
- projectQuotaCpu:
value: "8"
- projectQuotaMemory:
value: 16Gi
- staging:
- projectQuotaCpu:
value: "8"
- projectQuotaMemory:
value: 16Gi
- development:
- projectQuotaCpu:
value: "8"
- projectQuotaMemory:
value: 16Gi
The logs indicate the cluster is created. I think for some reason, the kuberay operator failed to submit the job to the ray cluster. Could you show me the status of rayJob by usingCopy codeRayJob associated rayCluster found
kubectl describe RayJob <name> -n flytesnack-development
Tarmily Wen
12/06/2022, 8:33 PMKevin Su
12/06/2022, 8:50 PMTarmily Wen
12/06/2022, 9:20 PMKevin Su
12/06/2022, 9:29 PMTarmily Wen
12/06/2022, 9:29 PMKevin Su
12/06/2022, 9:32 PMTarmily Wen
12/06/2022, 10:29 PMDavid Espejo (he/him)
01/10/2023, 9:41 PMTarmily Wen
01/10/2023, 9:41 PMDavid Espejo (he/him)
01/10/2023, 9:43 PMTarmily Wen
01/10/2023, 9:44 PM