<#CP2HDHKE1|> I'm using the ray plugin with Flyte and I'm a little confused. When I launch a RayJob ...
p
#CP2HDHKE1 I'm using the ray plugin with Flyte and I'm a little confused. When I launch a RayJob I'm finding that the submitter is not connecting to the existing raycluster created for the job, and is instead only running jobs on the submitter:
Copy code
@ray.remote
def f():
    val = os.uname().nodename
    print(val)
    return val


@task(
    container_image=custom_image,
    task_config=RayJobConfig(
        head_node_config=HeadNodeConfig(),
        worker_node_config=[
            WorkerNodeConfig(
                group_name="ray-group", replicas=2
            )
        ],
    ),
    requests=Resources(cpu="12", mem="64Gi", gpu="1"),
)
def ray_task() -> typing.List[str]:
    futures = [f.remote() for _ in range(100)]
Copy code
2025-02-18 19:09:07,996 INFO worker.py:1832 -- Started a local Ray instance. View the dashboard at <http://127.0.0.1:8265> 
2025-02-18 19:09:08,000 INFO packaging.py:574 -- Creating a file package for local module '/root'.
2025-02-18 19:09:08,001 INFO packaging.py:366 -- Pushing file package '<gcs://_ray_pkg_4a190130c7bd83a1.zip>' (0.00MiB) to Ray cluster...
2025-02-18 19:09:08,002 INFO packaging.py:379 -- Successfully pushed file package '<gcs://_ray_pkg_4a190130c7bd83a1.zip>'.
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1181) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1184) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1184) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1184) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1184) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1184) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1184) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1184) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1184) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1184) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1184) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1184) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1184) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1184) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1184) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1184) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1184) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1184) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1184) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1180) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1180) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1180) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1180) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1180) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1180) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1180) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1180) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1180) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1180) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1180) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1180) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1180) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1180) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1180) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1180) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1186) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1186) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1186) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1186) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1186) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1186) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1186) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1186) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1186) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1186) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1186) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1186) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1186) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1186) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1186) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1186) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1178) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1185) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1187) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1179) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1182) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1182) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1182) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1182) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1182) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1182) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1182) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1182) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1182) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1182) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
(f pid=1183) abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
When I try to add the
RAY_ADDRESS
to the submitter it says that I don't have the node config to do that and if I stop the submitter's ray server and then connect to the existing cluster it gets killed by a probe
I can get around this now with something like
ray start --address=...
, but my assumption based on the provided examples was that this isn't necessary.
c
When running remotely the Flyte Ray plugin requires the Ray Operator which will stand up a bunch of pods for the cluster. It looks like you're running this task locally so it will probably submit everything to an in-process ray cluster.
If you want to submit to an existing cluster I'd read this: https://github.com/flyteorg/flyte/issues/5877
p
I am running this on the cluster, and have installed kuberay and verified that all of the raycluster pods are starting up
f
@purple-father-70173 did you install
flytekitplugins-ray
in the image?
have you enabled the
ray
backend plugin in propeller config
c
@purple-father-70173 I'm curious to see what the ray custom resources as well as the submitter pods look like on the k8s cluster when the ray task is running. kuberay will unconditionally set the dashboard address on the submitter pod: https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/rayjob_controller.go#L487-L532
p
@freezing-airport-6809 yes and yes
@clean-glass-36808 when I get back to my laptop I can share one of the submitter configs. The dashboard address is being set correctly, it's just that ray is not picking it up
c
That and pod logs marked with which pod it came from would be helpful.
p
The pod logs I sent above are from the submitter, here's the submitter pod spec:
Copy code
apiVersion: v1
kind: Pod
metadata:
  annotations:
    <http://operator.1password.io/status|operator.1password.io/status>: injected
  creationTimestamp: "2025-02-18T19:08:56Z"
  generateName: abf8s2dxg8jrdfcs8bd8-raydevraytask-0-
  labels:
    <http://batch.kubernetes.io/controller-uid|batch.kubernetes.io/controller-uid>: 7bd62b58-f6ca-4d8a-9c77-02b7ac8d2708
    <http://batch.kubernetes.io/job-name|batch.kubernetes.io/job-name>: abf8s2dxg8jrdfcs8bd8-raydevraytask-0
    controller-uid: 7bd62b58-f6ca-4d8a-9c77-02b7ac8d2708
    domain: development
    execution-id: abf8s2dxg8jrdfcs8bd8
    flyte-pod: "true"
    interruptible: "false"
    job-name: abf8s2dxg8jrdfcs8bd8-raydevraytask-0
    node-id: raydevraytask
    project: flytesnacks
    shard-key: "8"
    task-name: ray-dev-ray-task
    workflow-name: flytegen-ray-dev-ray-task
  name: abf8s2dxg8jrdfcs8bd8-raydevraytask-0-hmqgh
  namespace: fl97
  ownerReferences:
  - apiVersion: batch/v1
    blockOwnerDeletion: true
    controller: true
    kind: Job
    name: abf8s2dxg8jrdfcs8bd8-raydevraytask-0
    uid: 7bd62b58-f6ca-4d8a-9c77-02b7ac8d2708
  resourceVersion: "144746380"
  uid: a3c2a5f2-1426-412d-9ce8-67cc6f733ab7
spec:
  affinity: {}
  containers:
  - args:
    - pyflyte-fast-execute
    - --additional-distribution
    - <s3://flyte/flytesnacks/development/VYXT4JISOSHJE4TPC7HGG63MYE======/faste62fbecc1ad336d6b69b63d0ddb08673.tar.gz>
    - --dest-dir
    - .
    - --
    - pyflyte-execute
    - --inputs
    - <s3://flyte/metadata/propeller/flytesnacks-development-abf8s2dxg8jrdfcs8bd8/raydevraytask/data/inputs.pb>
    - --output-prefix
    - <s3://flyte/metadata/propeller/flytesnacks-development-abf8s2dxg8jrdfcs8bd8/raydevraytask/data/0>
    - --raw-output-data-prefix
    - <s3://flyte/data/2e/abf8s2dxg8jrdfcs8bd8-raydevraytask-0>
    - --checkpoint-path
    - <s3://flyte/data/2e/abf8s2dxg8jrdfcs8bd8-raydevraytask-0/_flytecheckpoints>
    - --prev-checkpoint
    - '""'
    - --resolver
    - flytekit.core.python_auto_container.default_task_resolver
    - --
    - task-module
    - ray_dev
    - task-name
    - ray_task
    command:
    - /op/bin/op
    - run
    - --
    - /op/bin/op
    - run
    - --
    env:
    - name: OP_SERVICE_ACCOUNT_TOKEN
      valueFrom:
        secretKeyRef:
          key: token
          name: op-service-account
    - name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
      value: flytesnacks:development:.flytegen.ray_dev.ray_task
    - name: FLYTE_INTERNAL_EXECUTION_ID
      value: abf8s2dxg8jrdfcs8bd8
    - name: FLYTE_INTERNAL_EXECUTION_PROJECT
      value: flytesnacks
    - name: FLYTE_INTERNAL_EXECUTION_DOMAIN
      value: development
    - name: FLYTE_ATTEMPT_NUMBER
      value: "0"
    - name: FLYTE_INTERNAL_TASK_PROJECT
      value: flytesnacks
    - name: FLYTE_INTERNAL_TASK_DOMAIN
      value: development
    - name: FLYTE_INTERNAL_TASK_NAME
      value: ray_dev.ray_task
    - name: FLYTE_INTERNAL_TASK_VERSION
      value: ieQ9YgRouHsQLaDuyamylw
    - name: FLYTE_INTERNAL_PROJECT
      value: flytesnacks
    - name: FLYTE_INTERNAL_DOMAIN
      value: development
    - name: FLYTE_INTERNAL_NAME
      value: ray_dev.ray_task
    - name: FLYTE_INTERNAL_VERSION
      value: ieQ9YgRouHsQLaDuyamylw
    - name: _F_L_MIN_SIZE_MB
      value: "10"
    - name: _F_L_MAX_SIZE_MB
      value: "1000"
    - name: FLYTE_AWS_ENDPOINT
      value: <http://10.141.3.3:9000>
    - name: FLYTE_AWS_ACCESS_KEY_ID
      value: minio
    - name: FLYTE_AWS_SECRET_ACCESS_KEY
      value: miniostorage
    - name: PYTHONUNBUFFERED
      value: "1"
    - name: RAY_DASHBOARD_ADDRESS
      value: dfcs8bd8-raydevraytask-0-raycluster-dg9jg-head-svc.fl97.svc.cluster.local:8265
    - name: RAY_JOB_SUBMISSION_ID
      value: abf8s2dxg8jrdfcs8bd8-raydevraytask-0-tkdrl
    - name: OP_INTEGRATION_NAME
      value: 1Password Kubernetes Webhook
    - name: OP_INTEGRATION_ID
      value: K8W
    - name: OP_INTEGRATION_BUILDNUMBER
      value: "1000101"
    image: <http://579102688835.dkr.ecr.us-east-1.amazonaws.com/ssi:IaMZmB8OWMWoYdNNyJ0L0w|579102688835.dkr.ecr.us-east-1.amazonaws.com/ssi:IaMZmB8OWMWoYdNNyJ0L0w>
    imagePullPolicy: Always
    name: abf8s2dxg8jrdfcs8bd8-raydevraytask-0
    ports:
    - containerPort: 8080
      name: http
      protocol: TCP
    resources:
      limits:
        cpu: "12"
        memory: 64Gi
        <http://nvidia.com/gpu|nvidia.com/gpu>: "1"
      requests:
        cpu: "12"
        memory: 64Gi
        <http://nvidia.com/gpu|nvidia.com/gpu>: "1"
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /shared
      name: utility-volume
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-bh9db
      readOnly: true
    - mountPath: /op/bin/
      name: op-bin
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  imagePullSecrets:
  - name: ecr-ssi
  initContainers:
  - command:
    - sh
    - -c
    - cp /usr/local/bin/op /op/bin/
    image: 1password/op:2
    imagePullPolicy: IfNotPresent
    name: copy-op-bin
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /op/bin/
      name: op-bin
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-bh9db
      readOnly: true
  nodeName: fl97-hgx-04
  nodeSelector:
    <http://nvidia.com/gpu.present|nvidia.com/gpu.present>: "true"
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: dev
  serviceAccountName: dev
  terminationGracePeriodSeconds: 30
  tolerations:
  - key: <http://nvidia.com/gpu|nvidia.com/gpu>
    value: "true"
  - effect: NoSchedule
    key: <http://nvidia.com/gpu|nvidia.com/gpu>
    operator: Equal
    value: "true"
  - effect: NoExecute
    key: <http://node.kubernetes.io/not-ready|node.kubernetes.io/not-ready>
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: <http://node.kubernetes.io/unreachable|node.kubernetes.io/unreachable>
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: utility-volume
    persistentVolumeClaim:
      claimName: pvc-utility-exascaler
  - name: kube-api-access-bh9db
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
  - emptyDir:
      medium: Memory
    name: op-bin
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2025-02-18T19:09:15Z"
    status: "False"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2025-02-18T19:08:59Z"
    reason: PodCompleted
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2025-02-18T19:09:13Z"
    reason: PodCompleted
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2025-02-18T19:09:13Z"
    reason: PodCompleted
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2025-02-18T19:08:56Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: <containerd://dc54b94390650993c552f0f06bb5353e35cec9349e88c88319de520313fd9bf>9
    image: <http://579102688835.dkr.ecr.us-east-1.amazonaws.com/ssi:IaMZmB8OWMWoYdNNyJ0L0w|579102688835.dkr.ecr.us-east-1.amazonaws.com/ssi:IaMZmB8OWMWoYdNNyJ0L0w>
    imageID: <http://579102688835.dkr.ecr.us-east-1.amazonaws.com/ssi@sha256:b764f67ebdef3a6a972f7fb5f6657c7cad073ef06ea0de300c2a880b5857c891|579102688835.dkr.ecr.us-east-1.amazonaws.com/ssi@sha256:b764f67ebdef3a6a972f7fb5f6657c7cad073ef06ea0de300c2a880b5857c891>
    lastState: {}
    name: abf8s2dxg8jrdfcs8bd8-raydevraytask-0
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: <containerd://dc54b94390650993c552f0f06bb5353e35cec9349e88c88319de520313fd9bf>9
        exitCode: 0
        finishedAt: "2025-02-18T19:09:13Z"
        reason: Completed
        startedAt: "2025-02-18T19:09:00Z"
  hostIP: 10.141.1.4
  hostIPs:
  - ip: 10.141.1.4
  initContainerStatuses:
  - containerID: <containerd://737cca74705887e5efb4e0ca9e03acf029244b51dceb821c511bcb342b3f2e3>5
    image: <http://docker.io/1password/op:2|docker.io/1password/op:2>
    imageID: <http://docker.io/1password/op@sha256:e7b4dcc8df09659096cc7b7dfbeb6119eb49c2f01a5083d4c477ac5f9a23413d|docker.io/1password/op@sha256:e7b4dcc8df09659096cc7b7dfbeb6119eb49c2f01a5083d4c477ac5f9a23413d>
    lastState: {}
    name: copy-op-bin
    ready: true
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: <containerd://737cca74705887e5efb4e0ca9e03acf029244b51dceb821c511bcb342b3f2e3>5
        exitCode: 0
        finishedAt: "2025-02-18T19:08:59Z"
        reason: Completed
        startedAt: "2025-02-18T19:08:59Z"
  phase: Succeeded
  podIP: 10.42.4.164
  podIPs:
  - ip: 10.42.4.164
  qosClass: Burstable
  startTime: "2025-02-18T19:08:58Z"
I'm using the 1password secrets injector, so that's why the command is a bit weird
c
What is this?
/op/bin/op
p
Copy code
configuration:
  ..
  inline:
    ...
    tasks:
      task-plugins:
        enabled-plugins:
          - container
          - sidecar
          - k8s-array
          - pytorch
          - ray
        default-for-task-types:
          - container: container
          - container_array: k8s-array
          - pytorch: pytorch
          - ray: ray
    plugins:
      ...
      ray:
        ttlSecondsAfterFinished: 3600
rbac:
  extraRules:
    - apiGroups:
        - <http://kubeflow.org|kubeflow.org>
      resources:
        - pytorchjobs
      verbs:
        - create
        - delete
        - get
        - list
        - patch
        - update
        - watch
    - apiGroups:
        - "<http://ray.io|ray.io>"
      resources:
        - rayjobs
      verbs:
        - create
        - delete
        - get
        - list
        - patch
        - update
        - watch
c
Your container command should be
ray job submit
p
1password secrets injector
It's this, I can also share my k8s pod template
c
Yeah I think this is some sort of issue with the pod template overriding something
p
Copy code
apiVersion: v1
kind: PodTemplate
metadata:
  name: default
  namespace: fl97
template:
  metadata:
    labels:
      flyte-pod: "true"
  spec:
    imagePullSecrets:
      - name: ecr-ssi
    nodeSelector:
      <http://nvidia.com/gpu.present|nvidia.com/gpu.present>: "true"
    serviceAccountName: dev
    containers:
      - name: default
        image: <http://ghcr.io/flyteorg/flytekit:flyteinteractive-latest|ghcr.io/flyteorg/flytekit:flyteinteractive-latest>
        command: ["/op/bin/op", "run", "--"]
        env:
          - name: OP_SERVICE_ACCOUNT_TOKEN
            valueFrom:
              secretKeyRef:
                name: op-service-account
                key: token
        ports:
          - name: http
            containerPort: 8080
            protocol: TCP
        imagePullPolicy: Always
        volumeMounts:
          - name: utility-volume
            mountPath: /shared
    volumes:
      - name: utility-volume
        persistentVolumeClaim:
          claimName: pvc-utility-exascaler
    tolerations:
      - key: <http://nvidia.com/gpu|nvidia.com/gpu>
        value: "true"
It sounds like we have some clash with the command
well, I need a command to use the secrets injector.
We have used 1pw and use akeyless now but in both cases we used the Flyte secrets integrations to load secrets that were loaded in via
ExternalSecret
. Not sure about with ray tasks, I'd ahve to look
I am not familiar with how the secrets injector works but yeah that seems like the issue
p
So if I'm understanding this correctly, I should not use a
command
in my
PodTemplate
because it overrides specifically
RayJobConfig
?
I don't have this issue with
PyTorch
or any kind of
Python
task type
c
I am fairly confident that is the issue yeah. I'm not familiar with how the
PyTorch
plugin works maybe @freezing-airport-6809 works. I think for
python
tasks there is no
command
set so it probably just works without conflict
p
I think I can adjust my template to make this work, now that I know it's supposed to be
ray job submit
c
Well its not just
ray job submit
Copy code
command:
              - ray
              - job
              - submit
              - '--address'
              - <http://a4bfkvblbc4ljtmzp6ww-n0-0-head-svc.flytesnacks-production.svc.cluster.local:8265>
              - '--runtime-env-json'
              - '{"env_vars":{"PATH":"/run/flyte/bin:${PATH}"},"pip":["numpy","pandas"]}'
              - '--'
              - pyflyte-execute
              - '--inputs'
...
              - '--prev-checkpoint'
              - ''
              - '--resolver'
              - site-packages.flytekit.core.python_auto_container.default_task_resolver
              - '--'
              - task-module
              - ray_example
              - task-name
              - ray_task
p
Right, removing the 1password secrets injector requirement, removing the command so it's populated correctly.
idk why these couldn't just be all args, maybe that's an issue I need to file
c
Well all the
pyflyte-execute
stuff is also defined as args for me. I don't think the Ray plugin is used in a production sense much sense theres probably lots of cruft
Not sure if you realized but you'll want to put that GPU resource request onto the worker/head nodes and not the submitter once this is all resolved or you'll waste a GPU on the submitter
p
Yup, just trying to get any functionality
Thanks for your help! It's very appreciated
f
@clean-glass-36808
Well all the
pyflyte-execute
stuff is also defined as args for me. I don't think the Ray plugin is used in a production sense much sense theres probably lots of cruft
WDYM by this? We do use it in production at scale
I will have to read through this to figure out whats happening, this is a long thread and i am confused
c
I didn't think the Ray plugin was used in a production sense since the ray cluster pods couldn't be configured independently of the submitter until I landed those PRs. That seemed like a non-negotiable for running GPU workloads unless you're running tons of replicas with little requests on each. I couldn't imagine doing a multi-day training run and burning an H100 GPU just for a submitter pod to sit around and block. Maybe ya'll have CPU based workloads you run at scale and its not as big a deal or you're reusing a permanent Ray cluster (but even that has ux issues)?
p
The solution for now is to just remove
command
from my
PodTemplate
. I have it working now.