fierce-oil-47448
07/04/2024, 1:37 AMglamorous-carpet-83516
07/04/2024, 6:50 PMfierce-oil-47448
07/07/2024, 3:48 AMfierce-oil-47448
07/07/2024, 3:48 AMfierce-oil-47448
07/07/2024, 3:48 AMglamorous-carpet-83516
07/09/2024, 6:05 PMglamorous-carpet-83516
07/09/2024, 6:23 PMflat-area-42876
07/09/2024, 6:23 PMfierce-oil-47448
07/09/2024, 6:29 PMfierce-oil-47448
07/09/2024, 6:32 PMflat-area-42876
07/09/2024, 6:32 PMfierce-oil-47448
07/09/2024, 6:34 PMfierce-oil-47448
07/09/2024, 6:35 PMfierce-oil-47448
07/09/2024, 6:35 PMflat-area-42876
07/09/2024, 6:46 PMflat-area-42876
07/09/2024, 7:03 PMfierce-oil-47448
07/09/2024, 7:15 PMfierce-oil-47448
07/09/2024, 7:15 PMfierce-oil-47448
07/09/2024, 7:21 PMflat-area-42876
07/09/2024, 9:00 PMfierce-oil-47448
07/09/2024, 9:44 PMfierce-oil-47448
07/09/2024, 9:44 PMfierce-oil-47448
07/09/2024, 9:47 PMflat-area-42876
07/09/2024, 10:15 PMversion: 0
in the message?fierce-oil-47448
07/10/2024, 12:49 PMfierce-oil-47448
07/10/2024, 12:49 PMfierce-oil-47448
07/10/2024, 12:50 PM{"json":{…}, "level":"warning", "msg":"Failed to record taskEvent, error [AlreadyExists: Event already exists, caused by [rpc error: code = AlreadyExists desc = have already recorded task execution phase RUNNING (version: 107) for {{{} [] [] <nil>} 0 [] resource_type:TASK project:"data-processing" domain:"production" name:"ml_platform.workflows.data_lake_processing.workflow.map_single_dataset_shard_processor_051577342cda88558f6691693449ff7b-arraynode" version:"133a98c0040ade28c3f55ee9e087e0e1" node_id:"n2" execution_id:{project:"data-processing" domain:"production" name:"f371ad7b01e3424a974"} 0}]]. Trying to record state: RUNNING. Ignoring this error!", "ts":"2024-07-03T11:05:39Z"}
fierce-oil-47448
07/10/2024, 10:13 PMfierce-oil-47448
07/10/2024, 10:13 PMflat-area-42876
07/10/2024, 10:16 PMflat-area-42876
07/10/2024, 10:19 PMfierce-oil-47448
07/11/2024, 1:18 AMerror syncing 'data-processing-production/prathyush-katukojwala-a344f2cf267c48fba30': [READ_FAILED] failed to read data from dataDir [<gs://flyte-rawdata-423420/metadata/propeller/data-processing-production-prathyush-katukojwala-a344f2cf267c48fba30/n2/data/inputs.pb>]., caused by: path:<gs://flyte-rawdata-423420/metadata/propeller/data-processing-production-prathyush-katukojwala-a344f2cf267c48fba30/n2/data/inputs.pb>: [LIMIT_EXCEEDED] limit exceeded. 20.305138mb > 10mb. You can increase the limit by setting maxDownloadMBs.
and I am wondering if there is some relation between the tracking issues and this. We have fixed the limit exceeded issue and will monitor if the tracking problem goes away.flat-area-42876
07/11/2024, 9:29 AMfierce-oil-47448
07/11/2024, 7:32 PMthat would be a bigger bug as that error should bubble up to a failure.@flat-area-42876 That error is not bubbling up to a failure for sure. I was able to detect it by looking at the propeller logs to see if something is wrong. Because the job was stuck in queued state.
flat-area-42876
07/11/2024, 8:11 PMBecause the job was stuck in queued state.when a task fails with
failed to read data from dataDir...
did the workflow eventually fail and stop getting evaluated by propeller after running out of system retries? By stuck in queued state are you referring to state showed in the UI?
Also for clarification, are the tasks that are failing with failed to read data from dataDir...
part of the ArrayNode or a task under another node in the same workflow that you're seeing ArrayNode subtasks not getting their phases updated correctly?fierce-oil-47448
07/12/2024, 6:08 AMdid the workflow eventually fail and stop getting evaluated by propeller after running out of system retries?Not really. The node array got stuck in Queued state in one case, in the UI. In another case, the tracking of the individual tasks under the node array also stopped in the UI.
Also for clarification, are the tasks that are failing with `failed to read data from dataDir...`part of the ArrayNode or a task under another node in the same workflow that you're seeing ArrayNode subtasks not getting their phases updated correctly?I don't see the tasks failing, I get that error in propeller logs. I never see the tasks as failing in the UI. And I think that's the bigger problem. My workflow has only a single maptask and there are no nested tasks. It is workflow -> maptask.
flat-area-42876
07/12/2024, 6:57 AMfailed to read data from dataDir...
errors.
The node array got stuck in Queued state in one case, in the UIDid the workflow reach a terminal state in this case?
the tracking of the individual tasks under the node array also stopped in the UIDid the ArrayNode reach a terminal state? Also, did the workflow reach a terminal state in this case?
fierce-oil-47448
07/12/2024, 5:18 PMA small follow up: Was there overlap for the workflow executions that had the ArrayNode display in the UI as SUCCEEDED and subtasks phase as RUNNING and workflow executions that hadGreat question. No, I don't see those errors. However, there are a number of failures, such as:errors.failed to read data from dataDir...
fierce-oil-47448
07/12/2024, 5:19 PMfierce-oil-47448
07/12/2024, 5:21 PMfierce-oil-47448
07/16/2024, 7:23 AMfierce-oil-47448
07/16/2024, 7:24 AMflat-area-42876
07/16/2024, 5:27 PMAfter almost 10 days, I can still see this job with that state and I have full propeller logs (the screenshots shared above).This would be expected. Since propeller is no longer evaluating this workflow/the ArrayNode, there would not be any new events emitted to admin from propeller to update that state.
fierce-oil-47448
07/22/2024, 6:57 PMflat-area-42876
07/22/2024, 6:58 PMflat-area-42876
08/20/2024, 3:22 AMfierce-oil-47448
08/20/2024, 8:06 AMflat-area-42876
08/27/2024, 1:04 AM