Perhaps related I m testing fan out using `map task` with pr Flyte #flyte-support

Perhaps related: I'm testing fan-out using `map_t...

wide-lion-54536

08/16/2024, 3:28 PM

Perhaps related: I'm testing fan-out using

map_task

with progressively larger batches. In the graphs below, the first ramp is ~1,000 elements, the second is ~10,000, and the current is ~20,000. The task is straightforward: It sleeps between 0.25 and 0.5 seconds and returns a dataclass. The memory usage seems disproportionate with the job.

wide-lion-54536

08/16/2024, 3:30 PM

Currently have the

<http://cr.flyte.org/flyteorg/flyte-binary-release:v1.13.0|cr.flyte.org/flyteorg/flyte-binary-release:v1.13.0>

image running via the

flyte-binary-v1.13.0

chart.

wide-lion-54536

08/16/2024, 3:33 PM

Copy code

@dataclass
class Result(DataClassJSONMixin):
    ok: bool
    msg: str
    meta: dict[str, str | int | bool]


@task
def do_work(id: int) -> Optional[Result]:

    sleep(random.uniform(0.25, 0.5))
    if id % 23 == 0:
        # simulated error
        return None
    return Result(
        ok=True, msg=f"Hello, {id}", meta={"foo": "bar", "baz": id, "qux": True}
    )

@workflow
def do_flat_fanout(n: int = 1000) -> float:
    ids = generate_ids(n=n)
    results = map_task(do_work, concurrency=50)(id=ids)
    fr = filter_results(results=results)
    return reduce_results(results=fr)

wide-lion-54536

08/16/2024, 3:35 PM

(elided imports and filter/reduce tasks) Launched with

Copy code

pyflyte run --remote odyssey_data/flyte/examples/fanout.py do_flat_fanout --n 20037

average-finland-92144

08/16/2024, 4:13 PM

@wide-lion-54536 could you take a look at the flyteadmin logs? There's an issue being investigated on 1.13 that drives OOM on the backend Pods and it can be detected as hundreds of calls to a couple of gRPC methods

👀 1

wide-lion-54536

08/16/2024, 4:35 PM

I'm looking at all logs from the

flyte

namespace and don't see much during the time span in the graphs above until the pod's restarted at 08:23.

Copy code

Common labels: {"app":"flyte-binary","component":"flyte-binary","container":"flyte","filename":"/var/log/pods/flyte_flyte-binary-75855f5745-wjw9f_71d7974a-1eac-4af6-a8d6-fc9b4d09ab7c/flyte/0.log","instance":"flyte-binary","job":"flyte/flyte-binary","namespace":"flyte","node_name":"aks-cpularge-32799272-vmss000018","pod":"flyte-binary-75855f5745-wjw9f","stream":"stderr"}
Line limit: "2000 (7 displayed)"
Total bytes processed: "1.67  kB"


2024-08-16 07:01:23.963	W0816 14:01:23.963802       7 reflector.go:458] pkg/mod/k8s.io/client-go@v0.28.2/tools/cache/reflector.go:229: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: got short buffer with n=0, base=4092, cap=40960") has prevented the request from succeeding
2024-08-16 07:03:56.197	W0816 14:03:56.196876       7 reflector.go:458] pkg/mod/k8s.io/client-go@v0.28.2/tools/cache/reflector.go:229: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: got short buffer with n=0, base=4092, cap=20480") has prevented the request from succeeding
2024-08-16 07:44:58.647	W0816 14:44:58.646960       7 reflector.go:458] pkg/mod/k8s.io/client-go@v0.28.2/tools/cache/reflector.go:229: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: got short buffer with n=0, base=4092, cap=20480") has prevented the request from succeeding
2024-08-16 08:14:30.538	W0816 15:14:30.538669       7 reflector.go:458] pkg/mod/k8s.io/client-go@v0.28.2/tools/cache/reflector.go:229: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: got short buffer with n=0, base=4092, cap=20480") has prevented the request from succeeding
2024-08-16 08:23:27.170	E0816 15:23:27.170185       7 workers.go:103] error syncing 'flytesnacks-development/f675591f52ec2476baf4': worker error(s) encountered: [5]: 0: 
2024-08-16 08:23:27.170	Operation cannot be fulfilled on pods "f675591f52ec2476baf4-n1-0-n12071-0": the object has been modified; please apply your changes to the latest version and try again
2024-08-16 08:23:27.170

average-finland-92144

08/16/2024, 4:39 PM

do you know if you have set

inject-finalizers: true

? we've seen this same error message `the object has been modified; please apply your changes to the latest version and try again`in situations of high concurrency. While it's still under investigation, disabling finalizers seem to help as a temporary measure

wide-lion-54536

08/16/2024, 4:42 PM

Yep, it's set.

inline.plugins.k8s.inject-finalizer: true

average-finland-92144

08/16/2024, 4:43 PM

could you try setting it to

false

and test again?

wide-lion-54536

08/16/2024, 4:44 PM

Will do, thanks for the quick help here! Curious, is

concurrency=50

considered high?

average-finland-92144

08/16/2024, 4:45 PM

not at all

wide-lion-54536

08/16/2024, 6:38 PM

That completed successfully after disabling

inject-finalizer

(thanks!), but I'm still looking at what seems to be excessive memory pressure in flyte-binary, peaking at 19.5GiB from a clean start for a 20,079-element

map_task

I'm also not getting much of that memory back after the workflow completes.... hovering at around 16.4GiB 10 minutes later.

tall-lock-23197

08/19/2024, 6:06 AM

just an fyi: 5000 is the limit to run map subtasks https://discuss.flyte.org/t/22796044/hello-we-ve-been-experimenting-with-using-the-arraynode-map-#e697082c-1435-409c-b8cd-2bdf56bebe28

🙏 1

flat-area-42876

08/20/2024, 3:24 AM

Let me make an issue and look into this potential memory leak issue. Unsure if it's related to the issue we're seeing related to this issue. @wide-lion-54536 we just merged a fix in for the finalizers issue into our private fork. Will hopefully merge that into open source tomorrow. @tall-lock-23197 The ArrayNode map task implementation isn't using that config. Let me go ahead and add that in. Unsure if that were intentional or not

🙏 1

ripe-smartphone-56353

08/20/2024, 1:18 PM

Not sure if these issues are related - but I've tested our OOM issue on different flyte binary versions and have not seen a difference between the versions. I'll try to re-deploy with

flyte-core

to get a bit more visiblity into which subsystem uses all of that memory.

20 Views

Open in Slack

Previous Next