Not sure if a regression in 1.14.0 but we have som...
# flyte-support
c
Not sure if a regression in 1.14.0 but we have some tasks that seem to be stuck in a
Queued
state on the UI. Looking at the
flytepropeller
logs all I see is this repeated for over a day. One thing I noticed is that I never see a log for
acquired cache reservation
for this execution but I see it for others. > 2024-12-16 194232.587 {"json":{"exec_id":"f1886719e6ccc600f000","ns":"metrics-development","routine":"worker-72"},"level":"info","msg":"Processing Workflow.","ts":"2024-12-17T034232Z"} > 2024-12-16 194232.588 {"json":{"exec_id":"f1886719e6ccc600f000","ns":"metrics-development","res_ver":"190011148","routine":"worker-72","wf":"redacteddbt_streaming_sync"},"level":"info","msg":"Handling Workflow [f1886719e6ccc600f000], id: [project:\"metrics\" domain:\"development\" name:\"f1886719e6ccc600f000\"], p [Running]","ts":"2024-12-17T034232Z"} > 2024-12-16 194232.591 {"json":{"exec_id":"f1886719e6ccc600f000","node":"n0","ns":"metrics-development","res_ver":"190011148","routine":"worker-72","wf":"redacteddbt_streaming_sync"},"level":"info","msg":"Catalog CacheMiss: Artifact not found in Catalog. Executing Task.","ts":"2024-12-17T034232Z"} > 2024-12-16 194232.604 {"json":{"exec_id":"f1886719e6ccc600f000","ns":"metrics-development","res_ver":"190011148","routine":"worker-72","wf":"redacteddbt_streaming_sync"},"level":"info","msg":"Handling Workflow [f1886719e6ccc600f000] Done","ts":"2024-12-17T034232Z"} > 2024-12-16 194232.612 {"json":{"exec_id":"f1886719e6ccc600f000","ns":"metrics-development","routine":"worker-72"},"level":"info","msg":"Will not fast follow, Reason: Wf terminated? false, Version matched? true","ts":"2024-12-17T034232Z"} > 2024-12-16 194232.612 {"json":{"exec_id":"f1886719e6ccc600f000","ns":"metrics-development","routine":"worker-72"},"level":"info","msg":"Streak ended at [0]/Max: [8]","ts":"2024-12-17T034232Z"} > 2024-12-16 194232.612 {"json":{"exec_id":"f1886719e6ccc600f000","ns":"metrics-development","routine":"worker-72"},"level":"info","msg":"Completed processing workflow.","ts":"2024-12-17T034232Z"} > 2024-12-16 194232.612 {"json":{"exec_id":"f1886719e6ccc600f000","ns":"metrics-development","routine":"worker-72"},"level":"info","msg":"Successfully synced 'metrics-development/f1886719e6ccc600f000'","ts":"2024-12-17T034232Z"} > 2024-12-16 194233.584 {"json":{},"level":"info","msg":"==> Enqueueing workflow [metrics-development/f1886719e6ccc600f000]","ts":"2024-12-17T034233Z"}
Looking at the cache logic the key is defined as
Copy code
return catalog.Key{
		Identifier:           *taskTemplate.Id, //nolint:protogetter
		CacheVersion:         taskTemplate.GetMetadata().GetDiscoveryVersion(),
		CacheIgnoreInputVars: taskTemplate.GetMetadata().GetCacheIgnoreInputVars(),
		TypedInterface:       *taskTemplate.GetInterface(),
		InputReader:          nCtx.InputReader(),
	}, nil
and this is a task running on a launchplan/cron so I'm pretty sure the cache key will be consistent across runs... Which means someone must be holding onto the reservation.
We had a transient auth issue between flytepropeller and flyteadmin where I'm wondering if some state got lost/messed up:
Copy code
Workflow[metrics:.....dbt_streaming_sync] failed. RuntimeExecutionError: max number of system retry attempts [77/30] exhausted. Last known status message: Workflow[] failed. ErrorRecordingError: failed to publish event, caused by: EventSinkError: Error sending event, caused by [rpc error: code = Unknown desc = failed database operation with server login has been failing, try again later (server_login_retry)]
That is probably the issue
f
Is the task completed (UI not showing correct state) or is it stuck in QUEUED? Is the workflow still getting evaluated?
c
It was stuck in queued and getting evaluated. Pretty sure it was some issue with the cache but since I saw the temporary auth issues I just aborted the workflows. The auth issues were due to competing secrets since we didn't explicitly set
clientSecret
to
null
in the helm chart.
Probably not worth investigating further
f
Sounds good. Maybe there was a similar auth issue w/ propeller trying to connect to datacatalog and then propeller wasn't able to update the state that admin had already received due to the auth issue