Hi I have some Flyte workflow executions that use mapped tas Flyte #flyte-support

Hi. I have some Flyte workflow executions that use...

fierce-oil-47448

07/04/2024, 1:37 AM

Hi. I have some Flyte workflow executions that use mapped tasks. Sometimes these executions finish and show ‘SUCCESS’, but at the same time when you look at detailed task status, some task instances still show as ‘Running’. Looking at K8 pod status, all pods are complete. One particular nature of these workflows is that, max task parallelism is smaller than the number of task instances. Has anyone observed this kind of tracking problem?

glamorous-carpet-83516

07/04/2024, 6:50 PM

did you still see it running after you refresh the web page?

fierce-oil-47448

07/07/2024, 3:48 AM

Hi @glamorous-carpet-83516 Ye

fierce-oil-47448

07/07/2024, 3:48 AM

image.png

fierce-oil-47448

07/07/2024, 3:48 AM

Yes, even after more than a few days

glamorous-carpet-83516

07/09/2024, 6:05 PM

cc @flat-area-42876 have you seen this issue before?

flat-area-42876

07/09/2024, 6:22 PM

@glamorous-carpet-83516 yup - there's an issue and a potential fix waiting to be merged in that's probably related to this issue.

🚀 1

glamorous-carpet-83516

07/09/2024, 6:23 PM

awesome! thank you

flat-area-42876

07/09/2024, 6:23 PM

@fierce-oil-47448 did you set min_successes or min_success_ratio for that array node map task?

fierce-oil-47448

07/09/2024, 6:29 PM

@flat-area-42876 Yes, min success ratio is set

fierce-oil-47448

07/09/2024, 6:32 PM

Would there be a workaround?

flat-area-42876

07/09/2024, 6:32 PM

@fierce-oil-47448 was the min_success_ratio set to below 1.0 and would you be able to determine if those subtasks that were stuck in running actually failed?

fierce-oil-47448

07/09/2024, 6:34 PM

I checked all our configs

fierce-oil-47448

07/09/2024, 6:35 PM

They are all 1.0

fierce-oil-47448

07/09/2024, 6:35 PM

And the problematic run had no failures

flat-area-42876

07/09/2024, 6:46 PM

hmm - then there's some other pathway where ArrayNode is dropping events other than the one we noticed. Would you have access to the propeller logs from when those executions ran? This log (identifying metadata) would be of interest.

flat-area-42876

07/09/2024, 7:03 PM

Is this happening deterministically for those workflows? Would you have a sample workflow that could repro this? I'll try to repro this later today

fierce-oil-47448

07/09/2024, 7:15 PM

It is happening frequently. The jobs are quite complex though. So not easy to reduce it to a small reproduce.

fierce-oil-47448

07/09/2024, 7:15 PM

Checking out the logs now.

fierce-oil-47448

07/09/2024, 7:21 PM

@flat-area-42876 I don't see "event '%s' has already been sent" in the propeller logs

flat-area-42876

07/09/2024, 9:00 PM

@fierce-oil-47448 thank you for checking. Could you also check a couple more logs when you get the chance? https://github.com/flyteorg/flyte/blob/b7e69599cd24cf91793bbecf59ec6790076efcd4/flytepropeller/pkg/controller/nodes/node_exec_context.go#L39 and https://github.com/flyteorg/flyte/blob/b7e69599cd24cf91793bbecf59ec6790076efcd4/flytepropeller/pkg/controller/nodes/node_exec_context.go#L44

fierce-oil-47448

07/09/2024, 9:44 PM

@flat-area-42876 Plenty of those

fierce-oil-47448

07/09/2024, 9:44 PM

"json":{…}, "level":"warning", "msg":"Failed to record taskEvent, error [AlreadyExists: Event already exists, caused by [rpc error: code = AlreadyExists desc = have already recorded task execution phase RUNNING (version: 0) for {{{} [] [] nil} 0 [] resource_type:TASK project:"data-processing" domain:"production" name:"ml_platform.workflows.data_processing.workflow.map_single_dataset_shard_processor_a8f2749e939d42b24821878ff264334f-arraynode" version:"92d6cae91e777c5625fd32b918b953d5" node_id:"n0" execution_id{project"data-processing" domain:"production" name:"19ba1f38e6874902bd8"} 0}]]. Trying to record state: RUNNING. Ignoring this error!", "ts":"2024-07-02T221741Z"}

fierce-oil-47448

07/09/2024, 9:47 PM

{"json":{…}, "level":"warning", "msg":"Failed to record taskEvent, error [AlreadyExists: Event already exists, caused by [rpc error: code = AlreadyExists desc = have already recorded task execution phase RUNNING (version: 0) for {{{} [] [] <nil>} 0 [] resource_type:TASK project:"data-processing" domain:"production" name:"ml_platform.workflows.data_lake_processing.workflow.map_single_dataset_shard_processor_43e2299725b86f6735550fdcb2663de8-arraynode" version:"ed0e759f46df42f2160fbe4f5f639d8d" node_id:"n2" execution_id{project"data-processing" domain:"production" name:"ee1ea94d1af04d2eaa4"} 0}]]. Trying to record state: RUNNING. Ignoring this error!", "ts":"2024-07-02T231548Z"}

flat-area-42876

07/09/2024, 10:15 PM

@fierce-oil-47448 do all/most of those logs have

version: 0

in the message?

fierce-oil-47448

07/10/2024, 12:49 PM

Out of the 139 events that I see in the last two weeks, most have non-0 versions.

fierce-oil-47448

07/10/2024, 12:49 PM

@flat-area-42876 Only 11 has version: 0

fierce-oil-47448

07/10/2024, 12:50 PM

Here is an example:

Copy code

{"json":{…}, "level":"warning", "msg":"Failed to record taskEvent, error [AlreadyExists: Event already exists, caused by [rpc error: code = AlreadyExists desc = have already recorded task execution phase RUNNING (version: 107) for {{{} [] [] <nil>} 0 [] resource_type:TASK project:"data-processing" domain:"production" name:"ml_platform.workflows.data_lake_processing.workflow.map_single_dataset_shard_processor_051577342cda88558f6691693449ff7b-arraynode" version:"133a98c0040ade28c3f55ee9e087e0e1" node_id:"n2" execution_id:{project:"data-processing" domain:"production" name:"f371ad7b01e3424a974"} 0}]]. Trying to record state: RUNNING. Ignoring this error!", "ts":"2024-07-03T11:05:39Z"}

fierce-oil-47448

07/10/2024, 10:13 PM

@flat-area-42876 Forgot to mention that we're using v1.12.1-rc0

fierce-oil-47448

07/10/2024, 10:13 PM

We are planning to upgrade to v1.13

flat-area-42876

07/10/2024, 10:16 PM

Thanks for the added info. Wasn't able to get to looking deeper into this last night. Will have time to prioritize this after wrapping up some other work today.

flat-area-42876

07/10/2024, 10:19 PM

Believe this has to do with how we're incrementing task phase versions when we aggregate and then emit subtask events for arraynode

fierce-oil-47448

07/11/2024, 1:18 AM

@flat-area-42876 We have been getting these errors:

Copy code

error syncing 'data-processing-production/prathyush-katukojwala-a344f2cf267c48fba30': [READ_FAILED] failed to read data from dataDir [<gs://flyte-rawdata-423420/metadata/propeller/data-processing-production-prathyush-katukojwala-a344f2cf267c48fba30/n2/data/inputs.pb>]., caused by: path:<gs://flyte-rawdata-423420/metadata/propeller/data-processing-production-prathyush-katukojwala-a344f2cf267c48fba30/n2/data/inputs.pb>: [LIMIT_EXCEEDED] limit exceeded. 20.305138mb > 10mb. You can increase the limit by setting maxDownloadMBs.

and I am wondering if there is some relation between the tracking issues and this. We have fixed the limit exceeded issue and will monitor if the tracking problem goes away.

flat-area-42876

07/11/2024, 9:29 AM

I'm skeptical if that's related to the subtask phase issue. Please let me know if that clears it up though - that would be a bigger bug as that error should bubble up to a failure. I haven't been able to repro the issue you're running into with subtasks stuck in running phase when succeeded. Running a stress test overnight.

fierce-oil-47448

07/11/2024, 7:32 PM

that would be a bigger bug as that error should bubble up to a failure.

@flat-area-42876 That error is not bubbling up to a failure for sure. I was able to detect it by looking at the propeller logs to see if something is wrong. Because the job was stuck in queued state.

flat-area-42876

07/11/2024, 8:11 PM

@fierce-oil-47448

Because the job was stuck in queued state.

when a task fails with

failed to read data from dataDir...

did the workflow eventually fail and stop getting evaluated by propeller after running out of system retries? By stuck in queued state are you referring to state showed in the UI? Also for clarification, are the tasks that are failing with

failed to read data from dataDir...

part of the ArrayNode or a task under another node in the same workflow that you're seeing ArrayNode subtasks not getting their phases updated correctly?

fierce-oil-47448

07/12/2024, 6:08 AM

did the workflow eventually fail and stop getting evaluated by propeller after running out of system retries?

Not really. The node array got stuck in Queued state in one case, in the UI. In another case, the tracking of the individual tasks under the node array also stopped in the UI.

Also for clarification, are the tasks that are failing with `failed to read data from dataDir...`part of the ArrayNode or a task under another node in the same workflow that you're seeing ArrayNode subtasks not getting their phases updated correctly?

I don't see the tasks failing, I get that error in propeller logs. I never see the tasks as failing in the UI. And I think that's the bigger problem. My workflow has only a single maptask and there are no nested tasks. It is workflow -> maptask.

flat-area-42876

07/12/2024, 6:57 AM

@fierce-oil-47448 thank you for the added context. A small follow up: Was there overlap for the workflow executions that had the ArrayNode display in the UI as SUCCEEDED and subtasks phase as RUNNING and workflow executions that had

failed to read data from dataDir...

errors.

The node array got stuck in Queued state in one case, in the UI

Did the workflow reach a terminal state in this case?

the tracking of the individual tasks under the node array also stopped in the UI

Did the ArrayNode reach a terminal state? Also, did the workflow reach a terminal state in this case?

fierce-oil-47448

07/12/2024, 5:18 PM

A small follow up: Was there overlap for the workflow executions that had the ArrayNode display in the UI as SUCCEEDED and subtasks phase as RUNNING and workflow executions that had
failed to read data from dataDir...
errors.

Great question. No, I don't see those errors. However, there are a number of failures, such as:

fierce-oil-47448

07/12/2024, 5:19 PM

We recently upgraded to the latest Flyte version. I'll monitor if we still see this issue.

fierce-oil-47448

07/12/2024, 5:21 PM

> Did the workflow reach a terminal state in this case? > Did the ArrayNode reach a terminal state? Also, did the workflow reach a terminal state in this case? In both cases, no, it did not. But we aborted the workflow after a few hours. Not sure if it would have. It might be best to focus on the case where we have a SUCCEEDED job with array node task executions still in Running state. After almost 10 days, I can still see this job with that state and I have full propeller logs (the screenshots shared above).

fierce-oil-47448

07/16/2024, 7:23 AM

@flat-area-42876 With the latest version of Flyte, we still have this issue: job succeeds, but array node incorrectly shows some tasks in initialized state.

fierce-oil-47448

07/16/2024, 7:24 AM

No more failed to read from data issue, as we raised that limit.

flat-area-42876

07/16/2024, 5:27 PM

Thank you for the updates. I will be getting back on this in the next day - added this issue to our current sprint.

After almost 10 days, I can still see this job with that state and I have full propeller logs (the screenshots shared above).

This would be expected. Since propeller is no longer evaluating this workflow/the ArrayNode, there would not be any new events emitted to admin from propeller to update that state.

fierce-oil-47448

07/22/2024, 6:57 PM

Thanks @flat-area-42876. Was there any update on it?

flat-area-42876

07/22/2024, 6:58 PM

yup - we merged in a fix last week. Let me check how soon we can get a beta release out.

❤️ 1

flat-area-42876

08/20/2024, 3:22 AM

@fierce-oil-47448 there were additional issues with ArrayNode that caused some eventing issues. Just deploying those fixes to our fork now. Will get them merged into open source hopefully tomorrow.

❤️ 1

fierce-oil-47448

08/20/2024, 8:06 AM

Is there a release expected soon?

flat-area-42876

08/27/2024, 1:04 AM

there's a 1.13.1 release candidate. I believe 1.13.1 will be released very soon

13 Views

Open in Slack

Previous Next