Maciej Kopczyński
02/09/2024, 6:47 PMWorkflow[redacted] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[dn0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [k8s-array]:
Operation cannot be fulfilled on pods "a4cnhw27dj827cc5mzdv-n0-0-n1-0-dn0-0-70": the object has been modified; please apply your changes to the latest version and try again
This is roughly the code I am executing:
@dynamic
def get_data(
items: list[str],
param1: str,
param2: str,
param3: datetime,
param4: str,
) -> list[DataFrame]:
partial_query = functools.partial(
my_query,
param1=param1,
param2=param2,
param3=param3,
param4=param4,
)
my_data = map_task(
partial_query,
metadata=TaskMetadata(retries=0, timeout=timedelta(minutes=240)),
concurrency=10,
)(my_item=items).with_overrides(requests=AVAILABLE_RESOURCES["s"])
return my_data
So basically a map_task
that iterates over a list of items to get some data. The my_query
function is not very demanding, most of the time is spent waiting for response from an external service. There are no retries configured. My k8s cluster seems to be in a good shape: monitored metrics are looking fine, there are plenty of resources, no disturbing events. The set to iterate over can vary from few dozen to few thousand items (not more than 10000). It crashes every time, 357 processed items was the best result.
I have checked flyte-binary
logs (I am deploying to an on-premise cluster), the only error I saw was:
"Failed to cache Dynamic workflow [[CACHE_WRITE_FAILED] Failed to Cache the metadata, caused by: The entry size is larger than 1/1024 of cache size]",
...which does not seem related. Then immediately messages like this:
"Dynamic handler.Handle's called with phase 0.",
"Node level caching is disabled. Skipping catalog read.",
"Failed to clear finalizers for Resource with name: cmf-forecasts-production/aqhkngdswq792rsblbx7-n0-0-n1-0-dn0-0-357. Error: Operation cannot be fulfilled on pods \"aqhkngdswq792rsblbx7-n0-0-n1-0-dn0-0-357\": the object has been modified; please apply your changes to the latest version and try again",
I will appreciate any advice. I am running Flyte 1.9.1.Chris Grass
02/09/2024, 6:52 PMflyte-binary
running in the same cluster, even in different namespaces?Chris Grass
02/09/2024, 6:52 PMthe object has been modified; please apply your changes to the latest version and try again
when trying that setupKetan (kumare3)
Maciej Kopczyński
02/09/2024, 8:25 PMflyte-binary ml 30 2024-02-07 12:11:43.445627307 +0000 UTC deployed flyte-binary-v1.9.1 1.16.0
There are some old replicasets but without any pods:
$ kubectl get replicasets -A | grep flyte
ml flyte-binary-78bcfb6f48 0 0 0 102d
ml flyte-binary-555cf6fcdc 0 0 0 136d
ml flyte-binary-85dbb66fd7 1 1 1 4d6h
And only one running pod across all namespaces:
kubectl get pod -A | grep flyte
ml flyte-binary-85dbb66fd7-m8wgn 1/1 Running 0 4d6h
I can also see only one flyte process on the process list:
root 2659727 1.1 1.1 4119896 390080 ? Sl Feb05 73:54 /usr/local/bin/flyte start --config /etc/flyte/config.d/*.yaml
I have separate clusters for dev/staging/prod 🙂.Maciej Kopczyński
02/12/2024, 9:30 PMRequestResponse
auditing on my k8s cluster and followed through the whole cycle of the failed task. Basically it looks like this:
• Flyte: create
pod
• kubelite: patch
with ContainerCreating
• Go-http-client/2.0: patch
with CNI annotations (networking)
• kubelite: patch with Running
• kubelite: patch with Completed
• Go-http-client/2.0: patch
CNI annotations
• kubelite: patch
with Succeeded
• kubelite: delete
• Flyte: Update
<-- this causes 409
• Flyte: Update
workflow
(...)
The resource manifest that Flyte is trying to apply contains state which looks like after step 4: it has Running
status, annotation about assigned IPs and finalizers. Basically, Flyte seems unaware that the pod completed few seconds ago.
I am at a loss here, if you have any advice - any at all, I will be much obliged.