I have already spoke with our friend Glime <https flyte org Flyte #flyte-support

I have already spoke with our friend Glime (<https...

some-solstice-93243

02/09/2024, 6:47 PM

I have already spoke with our friend Glime (https://flyte-org.slack.com/archives/C06H1SFA19R/p1707490200283569), but I got stuck. This is the error I am getting:

Copy code

Workflow[redacted] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[dn0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [k8s-array]: 
Operation cannot be fulfilled on pods "a4cnhw27dj827cc5mzdv-n0-0-n1-0-dn0-0-70": the object has been modified; please apply your changes to the latest version and try again

This is roughly the code I am executing:

Copy code

@dynamic
def get_data(
    items: list[str],
    param1: str,
    param2: str,
    param3: datetime,
    param4: str,
) -> list[DataFrame]:
    partial_query = functools.partial(
        my_query,
        param1=param1,
        param2=param2,
        param3=param3,
        param4=param4,
    )
    my_data = map_task(
        partial_query,
        metadata=TaskMetadata(retries=0, timeout=timedelta(minutes=240)),
        concurrency=10,
    )(my_item=items).with_overrides(requests=AVAILABLE_RESOURCES["s"])
    return my_data

So basically a

map_task

that iterates over a list of items to get some data. The

my_query

function is not very demanding, most of the time is spent waiting for response from an external service. There are no retries configured. My k8s cluster seems to be in a good shape: monitored metrics are looking fine, there are plenty of resources, no disturbing events. The set to iterate over can vary from few dozen to few thousand items (not more than 10000). It crashes every time, 357 processed items was the best result. I have checked

flyte-binary

logs (I am deploying to an on-premise cluster), the only error I saw was:

Copy code

"Failed to cache Dynamic workflow [[CACHE_WRITE_FAILED] Failed to Cache the metadata, caused by: The entry size is larger than 1/1024 of cache size]",

...which does not seem related. Then immediately messages like this:

Copy code

"Dynamic handler.Handle's called with phase 0.",
"Node level caching is disabled. Skipping catalog read.",
"Failed to clear finalizers for Resource with name: cmf-forecasts-production/aqhkngdswq792rsblbx7-n0-0-n1-0-dn0-0-357. Error: Operation cannot be fulfilled on pods \"aqhkngdswq792rsblbx7-n0-0-n1-0-dn0-0-357\": the object has been modified; please apply your changes to the latest version and try again",

I will appreciate any advice. I am running Flyte 1.9.1.

proud-answer-87162

02/09/2024, 6:52 PM

any chance you have multiple instances of

flyte-binary

running in the same cluster, even in different namespaces?

proud-answer-87162

02/09/2024, 6:52 PM

i saw

the object has been modified; please apply your changes to the latest version and try again

when trying that setup

freezing-airport-6809

02/09/2024, 7:16 PM

ohh, ya it must be that some other process is updating the crd. This could be multiple propellers. you can enable leader-election

some-solstice-93243

02/09/2024, 8:25 PM

Hmm, I don't think so 🙂. I deploy Flyte with Helm, can only list a single deployment:

Copy code

flyte-binary         	ml               	30      	2024-02-07 12:11:43.445627307 +0000 UTC	deployed	flyte-binary-v1.9.1         	1.16.0

There are some old replicasets but without any pods:

Copy code

$ kubectl get replicasets -A | grep flyte
ml                  flyte-binary-78bcfb6f48                                 0         0         0       102d
ml                  flyte-binary-555cf6fcdc                                 0         0         0       136d
ml                  flyte-binary-85dbb66fd7                                 1         1         1       4d6h

And only one running pod across all namespaces:

Copy code

kubectl get pod -A | grep flyte
ml                         flyte-binary-85dbb66fd7-m8wgn                                 1/1     Running     0               4d6h

I can also see only one flyte process on the process list:

Copy code

root     2659727  1.1  1.1 4119896 390080 ?      Sl   Feb05  73:54 /usr/local/bin/flyte start --config /etc/flyte/config.d/*.yaml

I have separate clusters for dev/staging/prod 🙂.

some-solstice-93243

02/12/2024, 9:30 PM

I did a little more digging, enabled

RequestResponse

auditing on my k8s cluster and followed through the whole cycle of the failed task. Basically it looks like this: • Flyte:

create

pod • kubelite:

patch

with

ContainerCreating

• Go-http-client/2.0:

patch

with CNI annotations (networking) • kubelite: patch with

Running

• kubelite: patch with

Completed

• Go-http-client/2.0:

patch

CNI annotations • kubelite:

patch

with

Succeeded

• kubelite:

delete

• Flyte:

Update

<-- this causes 409 • Flyte:

Update

workflow (...) The resource manifest that Flyte is trying to apply contains state which looks like after step 4: it has

Running

status, annotation about assigned IPs and finalizers. Basically, Flyte seems unaware that the pod completed few seconds ago. I am at a loss here, if you have any advice - any at all, I will be much obliged.

19 Views

Open in Slack

Previous Next