Hi. I have some Flyte workflow executions that use...
# flyte-support
f
Hi. I have some Flyte workflow executions that use mapped tasks. Sometimes these executions finish and show ‘SUCCESS’, but at the same time when you look at detailed task status, some task instances still show as ‘Running’. Looking at K8 pod status, all pods are complete. One particular nature of these workflows is that, max task parallelism is smaller than the number of task instances. Has anyone observed this kind of tracking problem?
g
did you still see it running after you refresh the web page?
f
Hi @glamorous-carpet-83516 Ye
image.png
Yes, even after more than a few days
g
cc @flat-area-42876 have you seen this issue before?
f
@glamorous-carpet-83516 yup - there's an issue and a potential fix waiting to be merged in that's probably related to this issue.
🚀 1
g
awesome! thank you
f
@fierce-oil-47448 did you set min_successes or min_success_ratio for that array node map task?
f
@flat-area-42876 Yes, min success ratio is set
Would there be a workaround?
f
@fierce-oil-47448 was the min_success_ratio set to below 1.0 and would you be able to determine if those subtasks that were stuck in running actually failed?
f
I checked all our configs
They are all 1.0
And the problematic run had no failures
f
hmm - then there's some other pathway where ArrayNode is dropping events other than the one we noticed. Would you have access to the propeller logs from when those executions ran? This log (identifying metadata) would be of interest.
Is this happening deterministically for those workflows? Would you have a sample workflow that could repro this? I'll try to repro this later today
f
It is happening frequently. The jobs are quite complex though. So not easy to reduce it to a small reproduce.
Checking out the logs now.
@flat-area-42876 I don't see "event '%s' has already been sent" in the propeller logs
f
@flat-area-42876 Plenty of those
"json":{…}, "level":"warning", "msg":"Failed to record taskEvent, error [AlreadyExists: Event already exists, caused by [rpc error: code = AlreadyExists desc = have already recorded task execution phase RUNNING (version: 0) for {{{} [] [] nil} 0 [] resource_type:TASK project:"data-processing" domain:"production" name:"ml_platform.workflows.data_processing.workflow.map_single_dataset_shard_processor_a8f2749e939d42b24821878ff264334f-arraynode" version:"92d6cae91e777c5625fd32b918b953d5" node_id:"n0" execution_id{project"data-processing" domain:"production" name:"19ba1f38e6874902bd8"} 0}]]. Trying to record state: RUNNING. Ignoring this error!", "ts":"2024-07-02T221741Z"}
{"json":{…}, "level":"warning", "msg":"Failed to record taskEvent, error [AlreadyExists: Event already exists, caused by [rpc error: code = AlreadyExists desc = have already recorded task execution phase RUNNING (version: 0) for {{{} [] [] <nil>} 0 [] resource_type:TASK project:"data-processing" domain:"production" name:"ml_platform.workflows.data_lake_processing.workflow.map_single_dataset_shard_processor_43e2299725b86f6735550fdcb2663de8-arraynode" version:"ed0e759f46df42f2160fbe4f5f639d8d" node_id:"n2" execution_id{project"data-processing" domain:"production" name:"ee1ea94d1af04d2eaa4"} 0}]]. Trying to record state: RUNNING. Ignoring this error!", "ts":"2024-07-02T231548Z"}
f
@fierce-oil-47448 do all/most of those logs have
version: 0
in the message?
f
Out of the 139 events that I see in the last two weeks, most have non-0 versions.
@flat-area-42876 Only 11 has version: 0
Here is an example:
Copy code
{"json":{…}, "level":"warning", "msg":"Failed to record taskEvent, error [AlreadyExists: Event already exists, caused by [rpc error: code = AlreadyExists desc = have already recorded task execution phase RUNNING (version: 107) for {{{} [] [] <nil>} 0 [] resource_type:TASK project:"data-processing" domain:"production" name:"ml_platform.workflows.data_lake_processing.workflow.map_single_dataset_shard_processor_051577342cda88558f6691693449ff7b-arraynode" version:"133a98c0040ade28c3f55ee9e087e0e1" node_id:"n2" execution_id:{project:"data-processing" domain:"production" name:"f371ad7b01e3424a974"} 0}]]. Trying to record state: RUNNING. Ignoring this error!", "ts":"2024-07-03T11:05:39Z"}
@flat-area-42876 Forgot to mention that we're using v1.12.1-rc0
We are planning to upgrade to v1.13
f
Thanks for the added info. Wasn't able to get to looking deeper into this last night. Will have time to prioritize this after wrapping up some other work today.
Believe this has to do with how we're incrementing task phase versions when we aggregate and then emit subtask events for arraynode
f
@flat-area-42876 We have been getting these errors:
Copy code
error syncing 'data-processing-production/prathyush-katukojwala-a344f2cf267c48fba30': [READ_FAILED] failed to read data from dataDir [<gs://flyte-rawdata-423420/metadata/propeller/data-processing-production-prathyush-katukojwala-a344f2cf267c48fba30/n2/data/inputs.pb>]., caused by: path:<gs://flyte-rawdata-423420/metadata/propeller/data-processing-production-prathyush-katukojwala-a344f2cf267c48fba30/n2/data/inputs.pb>: [LIMIT_EXCEEDED] limit exceeded. 20.305138mb > 10mb. You can increase the limit by setting maxDownloadMBs.
and I am wondering if there is some relation between the tracking issues and this. We have fixed the limit exceeded issue and will monitor if the tracking problem goes away.
f
I'm skeptical if that's related to the subtask phase issue. Please let me know if that clears it up though - that would be a bigger bug as that error should bubble up to a failure. I haven't been able to repro the issue you're running into with subtasks stuck in running phase when succeeded. Running a stress test overnight.
f
that would be a bigger bug as that error should bubble up to a failure.
@flat-area-42876 That error is not bubbling up to a failure for sure. I was able to detect it by looking at the propeller logs to see if something is wrong. Because the job was stuck in queued state.
f
@fierce-oil-47448
Because the job was stuck in queued state.
when a task fails with
failed to read data from dataDir...
did the workflow eventually fail and stop getting evaluated by propeller after running out of system retries? By stuck in queued state are you referring to state showed in the UI? Also for clarification, are the tasks that are failing with
failed to read data from dataDir...
part of the ArrayNode or a task under another node in the same workflow that you're seeing ArrayNode subtasks not getting their phases updated correctly?
f
did the workflow eventually fail and stop getting evaluated by propeller after running out of system retries?
Not really. The node array got stuck in Queued state in one case, in the UI. In another case, the tracking of the individual tasks under the node array also stopped in the UI.
Also for clarification, are the tasks that are failing with `failed to read data from dataDir...`part of the ArrayNode or a task under another node in the same workflow that you're seeing ArrayNode subtasks not getting their phases updated correctly?
I don't see the tasks failing, I get that error in propeller logs. I never see the tasks as failing in the UI. And I think that's the bigger problem. My workflow has only a single maptask and there are no nested tasks. It is workflow -> maptask.
f
@fierce-oil-47448 thank you for the added context. A small follow up: Was there overlap for the workflow executions that had the ArrayNode display in the UI as SUCCEEDED and subtasks phase as RUNNING and workflow executions that had
failed to read data from dataDir...
errors.
The node array got stuck in Queued state in one case, in the UI
Did the workflow reach a terminal state in this case?
the tracking of the individual tasks under the node array also stopped in the UI
Did the ArrayNode reach a terminal state? Also, did the workflow reach a terminal state in this case?
f
A small follow up: Was there overlap for the workflow executions that had the ArrayNode display in the UI as SUCCEEDED and subtasks phase as RUNNING and workflow executions that had
failed to read data from dataDir...
errors.
Great question. No, I don't see those errors. However, there are a number of failures, such as:
We recently upgraded to the latest Flyte version. I'll monitor if we still see this issue.
> Did the workflow reach a terminal state in this case? > Did the ArrayNode reach a terminal state? Also, did the workflow reach a terminal state in this case? In both cases, no, it did not. But we aborted the workflow after a few hours. Not sure if it would have. It might be best to focus on the case where we have a SUCCEEDED job with array node task executions still in Running state. After almost 10 days, I can still see this job with that state and I have full propeller logs (the screenshots shared above).
@flat-area-42876 With the latest version of Flyte, we still have this issue: job succeeds, but array node incorrectly shows some tasks in initialized state.
No more failed to read from data issue, as we raised that limit.
f
Thank you for the updates. I will be getting back on this in the next day - added this issue to our current sprint.
After almost 10 days, I can still see this job with that state and I have full propeller logs (the screenshots shared above).
This would be expected. Since propeller is no longer evaluating this workflow/the ArrayNode, there would not be any new events emitted to admin from propeller to update that state.
f
Thanks @flat-area-42876. Was there any update on it?
f
yup - we merged in a fix last week. Let me check how soon we can get a beta release out.
❤️ 1
@fierce-oil-47448 there were additional issues with ArrayNode that caused some eventing issues. Just deploying those fixes to our fork now. Will get them merged into open source hopefully tomorrow.
❤️ 1
f
Is there a release expected soon?
f
there's a 1.13.1 release candidate. I believe 1.13.1 will be released very soon