Hi all, I'm experiencing an issue with subworkflow...
# flyte-support
m
Hi all, I'm experiencing an issue with subworkflows hanging if they are running tasks that spin up ray clusters. My workflow looks like this:
Copy code
@dynamic
def batched_workflow(...):
    for n in repeats:
        sub_workflow(...)

@workflow
def sub_workflow(...):
    flyte_ray_task(...)

@task(task_config=RAY_JOB_CONFIG)
def flyte_ray_task(...)
    ray.get([ray_fn.remote(...)])

@ray.remote
def ray_fn(...)
   ...
When launching this nested
batched_workflow
, once the subworkflow gets to
flyte_ray_task
, it hangs. No pods for the ray cluster are created as would normally happen when
sub_workflow
gets launched in isolation. Inspecting the logs of the raycluster object that is created, I see the following:
Copy code
│   Warning  FailedToCreateIngress  2m14s (x25 over 124m)  raycluster-controller  Failed creating ingress raycluster/qkdzqy-0-dn0-0-dn2-0-raycluster-f5xjz-head-ingress, Ingress │
│ .<http://networking.k8s.io|networking.k8s.io> "qkdzqy-0-dn0-0-dn2-0-raycluster-f5xjz-head-ingress" is invalid: metadata.labels: Invalid value: "ajct5z8clztdl9vgb7gh-fxqkdzqy-0-dn0-0-dn2-0-raycluster-f5 │
│ xjz-head": must be no more than 63 characters
So it looks like because of the nesting flyte creates a very long label that blows up due to a k8s limitation. Does anyone know a workaround for this?
g
I think that label is created by kuberay operator