Hi all I m experiencing an issue with subworkflows hanging i Flyte #flyte-support

Hi all, I'm experiencing an issue with subworkflow...

mammoth-mouse-1111

05/13/2025, 8:45 PM

Hi all, I'm experiencing an issue with subworkflows hanging if they are running tasks that spin up ray clusters. My workflow looks like this:

Copy code

@dynamic
def batched_workflow(...):
    for n in repeats:
        sub_workflow(...)

@workflow
def sub_workflow(...):
    flyte_ray_task(...)

@task(task_config=RAY_JOB_CONFIG)
def flyte_ray_task(...)
    ray.get([ray_fn.remote(...)])

@ray.remote
def ray_fn(...)
   ...

When launching this nested

batched_workflow

, once the subworkflow gets to

flyte_ray_task

, it hangs. No pods for the ray cluster are created as would normally happen when

sub_workflow

gets launched in isolation. Inspecting the logs of the raycluster object that is created, I see the following:

Copy code

│   Warning  FailedToCreateIngress  2m14s (x25 over 124m)  raycluster-controller  Failed creating ingress raycluster/qkdzqy-0-dn0-0-dn2-0-raycluster-f5xjz-head-ingress, Ingress │
│ .<http://networking.k8s.io|networking.k8s.io> "qkdzqy-0-dn0-0-dn2-0-raycluster-f5xjz-head-ingress" is invalid: metadata.labels: Invalid value: "ajct5z8clztdl9vgb7gh-fxqkdzqy-0-dn0-0-dn2-0-raycluster-f5 │
│ xjz-head": must be no more than 63 characters

So it looks like because of the nesting flyte creates a very long label that blows up due to a k8s limitation. Does anyone know a workaround for this?

glamorous-carpet-83516

05/13/2025, 9:31 PM

I think that label is created by kuberay operator

6 Views

Open in Slack

Previous Next