https://flyte.org logo
Title
p

Padma Priya M

12/08/2022, 3:11 PM
Hi, while initiating ray cluster, the task is running in only one instance and pod. Generally if a ray cluster is initiated it is expected to run in different instance in distributed manner right? can we do horizontal scaling here to increase the pool of resources here?
s

Samhita Alla

12/09/2022, 4:33 AM
cc: @Kevin Su
k

Kevin Su

12/09/2022, 5:35 AM
hmm, if ray task is started, propeller should create head node and workers nodes. did you enable the ray plugin in propeller?
tasks:
  task-plugins:
    enabled-plugins:
      - container
      - sidecar
      - k8s-array
      - ray
    default-for-task-types:
      container: container
      sidecar: sidecar
      container_array: k8s-array
      ray: ray
p

Padma Priya M

12/09/2022, 12:50 PM
Yeah ray plugin is enabled
k

Kevin Su

12/09/2022, 6:13 PM
is there any error in the kuberay operator?
p

Padma Priya M

12/12/2022, 5:05 AM
not sure. how to check if it works fine?
k

Kevin Su

12/12/2022, 8:23 AM
kubectl logs <kuberay-operator> -n ray-system
p

Padma Priya M

12/12/2022, 9:24 AM
k

Kevin Su

12/12/2022, 7:31 PM
have you installed ingress controller? if not, it will cause an error in kuberay, kuberay use ingress controller to create a new ingress route for RayJob
p

Padma Priya M

12/13/2022, 9:32 AM
yes ingress controller is installed in the setup
k

Kevin Su

12/13/2022, 8:43 PM
@Padma Priya M do you have couple mins to hop on a call?
p

Padma Priya M

12/14/2022, 4:22 AM
sure ... pls let me know ur feasible timings
k

Kevin Su

12/14/2022, 7:31 PM
maybe 9~12 AM in your time
p

Padma Priya M

12/18/2022, 6:44 AM
Sorry for the inconvenience @Kevin Su. We were having live demo so couldn't work on the setup. Will tomorrow same time work for u ?
k

Kevin Su

12/18/2022, 6:47 AM
No worries, yes, ping me tomorrow when you are available
p

Padma Priya M

12/19/2022, 1:49 PM
Hi actually once the helm is upgraded I am able to see the worker pods getting created. But the issue now is that the task is getting queued for a long time it is not getting initiated. It gets
The node was low on resource: ephemeral-storage
and it is trying to initiate a new pod but we have enough ephemeral storage in the instance.
The docker image that we are trying to pull is nearly 10gb. will that be an issue? shall we connect by tomorrow mrng 9 AM on my time? can u confirm on where to connect through slack or google meet?
k

Kevin Su

12/19/2022, 6:31 PM
I’ll call you at 9am your time through google meet