I've got a workflow in a hung state. It has a dyna...
# flyte-support
f
I've got a workflow in a hung state. It has a dynamic task with a sub-workflow. When I click into the sub-workflow, it shows as completed successfully, but the parent workflow still shows it as running. I've been spelunking through propeller logs this morning, one big thing standing out so far: Near the time the issues begin, I start seeing "Failed to cast contentMD5 [] to string" in the logs, which has continued at a fairly regular interval ever since it first appeared. The exec_id is that of the child workflow. (full log line in thread)
Copy code
{
  "level": "warning",
  "json": {
    "exec_id": "fyxjcfpbgcop5i",
    "routine": "worker-4681",
    "ns": "<redact>",
    "wf": "<redact>",
    "src": "stow_store.go:235",
    "node": "n5",
    "res_ver": "7698959083"
  },
  "msg": "Failed to cast contentMD5 [] to string",
  "ts": "2024-08-21T04:53:44Z"
}
We've confirmed just now that a propeller restart seems to have gotten things un-stuck, so there's some in-process state that is problematic
I'll keep looking at logs for another proximate root cause
t
cc @flat-area-42876 just fyi
f
We do continue to get that contentMD5 warning about 2x a minute for the 12 hours after the workflow got stuck, cleared up after the restart as well
f
What version of propeller are you running?
f
Looks like this was on 1.12.0
f
@famous-flag-22960 also wanted to double check, by subworkflow you don't mean launching an external workflow. Could you share a code snippet showing the structure of your workflow?
(There was a bug in propellor 1.12 with propagating child/external workflow state. The fix is available in a 1.13)
f
Sorry, have to piece these together. Looks like the core bit is structured roughtly like this
Copy code
LP = LaunchPlan.create(
    "CHILD_LP", child_wf
)

@workflow
def wf():
   a()

@dynamic
def parent():
    for i in range(10):
        LP()
That bug sure does sound like it could be it. Lemme see about upgrading us in short order then. We've been hitting this a few times a week for a little bit now according to some of my team, so if this fixes it, we'll know soon enough
f
@famous-flag-22960 did upgrading your propeller version end up resolving this issue?
f
It looks like it has, or has at least decreased the frequency to a point where it's not getting escalated to me or clearly showing up in metrics