I have already spoke with our friend Glime (<https...
# ask-the-community
m
I have already spoke with our friend Glime (https://flyte-org.slack.com/archives/C06H1SFA19R/p1707490200283569), but I got stuck. This is the error I am getting:
Copy code
Workflow[redacted] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[dn0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [k8s-array]: 
Operation cannot be fulfilled on pods "a4cnhw27dj827cc5mzdv-n0-0-n1-0-dn0-0-70": the object has been modified; please apply your changes to the latest version and try again
This is roughly the code I am executing:
Copy code
@dynamic
def get_data(
    items: list[str],
    param1: str,
    param2: str,
    param3: datetime,
    param4: str,
) -> list[DataFrame]:
    partial_query = functools.partial(
        my_query,
        param1=param1,
        param2=param2,
        param3=param3,
        param4=param4,
    )
    my_data = map_task(
        partial_query,
        metadata=TaskMetadata(retries=0, timeout=timedelta(minutes=240)),
        concurrency=10,
    )(my_item=items).with_overrides(requests=AVAILABLE_RESOURCES["s"])
    return my_data
So basically a
map_task
that iterates over a list of items to get some data. The
my_query
function is not very demanding, most of the time is spent waiting for response from an external service. There are no retries configured. My k8s cluster seems to be in a good shape: monitored metrics are looking fine, there are plenty of resources, no disturbing events. The set to iterate over can vary from few dozen to few thousand items (not more than 10000). It crashes every time, 357 processed items was the best result. I have checked
flyte-binary
logs (I am deploying to an on-premise cluster), the only error I saw was:
Copy code
"Failed to cache Dynamic workflow [[CACHE_WRITE_FAILED] Failed to Cache the metadata, caused by: The entry size is larger than 1/1024 of cache size]",
...which does not seem related. Then immediately messages like this:
Copy code
"Dynamic handler.Handle's called with phase 0.",
"Node level caching is disabled. Skipping catalog read.",
"Failed to clear finalizers for Resource with name: cmf-forecasts-production/aqhkngdswq792rsblbx7-n0-0-n1-0-dn0-0-357. Error: Operation cannot be fulfilled on pods \"aqhkngdswq792rsblbx7-n0-0-n1-0-dn0-0-357\": the object has been modified; please apply your changes to the latest version and try again",
I will appreciate any advice. I am running Flyte 1.9.1.
c
any chance you have multiple instances of
flyte-binary
running in the same cluster, even in different namespaces?
i saw
the object has been modified; please apply your changes to the latest version and try again
when trying that setup
k
ohh, ya it must be that some other process is updating the crd. This could be multiple propellers. you can enable leader-election
m
Hmm, I don't think so 🙂. I deploy Flyte with Helm, can only list a single deployment:
Copy code
flyte-binary         	ml               	30      	2024-02-07 12:11:43.445627307 +0000 UTC	deployed	flyte-binary-v1.9.1         	1.16.0
There are some old replicasets but without any pods:
Copy code
$ kubectl get replicasets -A | grep flyte
ml                  flyte-binary-78bcfb6f48                                 0         0         0       102d
ml                  flyte-binary-555cf6fcdc                                 0         0         0       136d
ml                  flyte-binary-85dbb66fd7                                 1         1         1       4d6h
And only one running pod across all namespaces:
Copy code
kubectl get pod -A | grep flyte
ml                         flyte-binary-85dbb66fd7-m8wgn                                 1/1     Running     0               4d6h
I can also see only one flyte process on the process list:
Copy code
root     2659727  1.1  1.1 4119896 390080 ?      Sl   Feb05  73:54 /usr/local/bin/flyte start --config /etc/flyte/config.d/*.yaml
I have separate clusters for dev/staging/prod 🙂.
I did a little more digging, enabled
RequestResponse
auditing on my k8s cluster and followed through the whole cycle of the failed task. Basically it looks like this: • Flyte:
create
pod • kubelite:
patch
with
ContainerCreating
• Go-http-client/2.0:
patch
with CNI annotations (networking) • kubelite: patch with
Running
• kubelite: patch with
Completed
• Go-http-client/2.0:
patch
CNI annotations • kubelite:
patch
with
Succeeded
• kubelite:
delete
• Flyte:
Update
<-- this causes 409 • Flyte:
Update
workflow (...) The resource manifest that Flyte is trying to apply contains state which looks like after step 4: it has
Running
status, annotation about assigned IPs and finalizers. Basically, Flyte seems unaware that the pod completed few seconds ago. I am at a loss here, if you have any advice - any at all, I will be much obliged.