Perhaps related: I'm testing fan-out using `map_t...
# flyte-support
w
Perhaps related: I'm testing fan-out using
map_task
with progressively larger batches. In the graphs below, the first ramp is ~1,000 elements, the second is ~10,000, and the current is ~20,000. The task is straightforward: It sleeps between 0.25 and 0.5 seconds and returns a dataclass. The memory usage seems disproportionate with the job.
Currently have the
<http://cr.flyte.org/flyteorg/flyte-binary-release:v1.13.0|cr.flyte.org/flyteorg/flyte-binary-release:v1.13.0>
image running via the
flyte-binary-v1.13.0
chart.
Copy code
@dataclass
class Result(DataClassJSONMixin):
    ok: bool
    msg: str
    meta: dict[str, str | int | bool]


@task
def do_work(id: int) -> Optional[Result]:

    sleep(random.uniform(0.25, 0.5))
    if id % 23 == 0:
        # simulated error
        return None
    return Result(
        ok=True, msg=f"Hello, {id}", meta={"foo": "bar", "baz": id, "qux": True}
    )

@workflow
def do_flat_fanout(n: int = 1000) -> float:
    ids = generate_ids(n=n)
    results = map_task(do_work, concurrency=50)(id=ids)
    fr = filter_results(results=results)
    return reduce_results(results=fr)
(elided imports and filter/reduce tasks) Launched with
Copy code
pyflyte run --remote odyssey_data/flyte/examples/fanout.py do_flat_fanout --n 20037
a
@wide-lion-54536 could you take a look at the flyteadmin logs? There's an issue being investigated on 1.13 that drives OOM on the backend Pods and it can be detected as hundreds of calls to a couple of gRPC methods
👀 1
w
I'm looking at all logs from the
flyte
namespace and don't see much during the time span in the graphs above until the pod's restarted at 08:23.
Copy code
Common labels: {"app":"flyte-binary","component":"flyte-binary","container":"flyte","filename":"/var/log/pods/flyte_flyte-binary-75855f5745-wjw9f_71d7974a-1eac-4af6-a8d6-fc9b4d09ab7c/flyte/0.log","instance":"flyte-binary","job":"flyte/flyte-binary","namespace":"flyte","node_name":"aks-cpularge-32799272-vmss000018","pod":"flyte-binary-75855f5745-wjw9f","stream":"stderr"}
Line limit: "2000 (7 displayed)"
Total bytes processed: "1.67  kB"


2024-08-16 07:01:23.963	W0816 14:01:23.963802       7 reflector.go:458] pkg/mod/k8s.io/client-go@v0.28.2/tools/cache/reflector.go:229: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: got short buffer with n=0, base=4092, cap=40960") has prevented the request from succeeding
2024-08-16 07:03:56.197	W0816 14:03:56.196876       7 reflector.go:458] pkg/mod/k8s.io/client-go@v0.28.2/tools/cache/reflector.go:229: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: got short buffer with n=0, base=4092, cap=20480") has prevented the request from succeeding
2024-08-16 07:44:58.647	W0816 14:44:58.646960       7 reflector.go:458] pkg/mod/k8s.io/client-go@v0.28.2/tools/cache/reflector.go:229: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: got short buffer with n=0, base=4092, cap=20480") has prevented the request from succeeding
2024-08-16 08:14:30.538	W0816 15:14:30.538669       7 reflector.go:458] pkg/mod/k8s.io/client-go@v0.28.2/tools/cache/reflector.go:229: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: got short buffer with n=0, base=4092, cap=20480") has prevented the request from succeeding
2024-08-16 08:23:27.170	E0816 15:23:27.170185       7 workers.go:103] error syncing 'flytesnacks-development/f675591f52ec2476baf4': worker error(s) encountered: [5]: 0: 
2024-08-16 08:23:27.170	Operation cannot be fulfilled on pods "f675591f52ec2476baf4-n1-0-n12071-0": the object has been modified; please apply your changes to the latest version and try again
2024-08-16 08:23:27.170
a
do you know if you have set
inject-finalizers: true
? we've seen this same error message `the object has been modified; please apply your changes to the latest version and try again`in situations of high concurrency. While it's still under investigation, disabling finalizers seem to help as a temporary measure
w
Yep, it's set.
inline.plugins.k8s.inject-finalizer: true
a
could you try setting it to
false
and test again?
w
Will do, thanks for the quick help here! Curious, is
concurrency=50
considered high?
a
not at all
w
That completed successfully after disabling
inject-finalizer
(thanks!), but I'm still looking at what seems to be excessive memory pressure in flyte-binary, peaking at 19.5GiB from a clean start for a 20,079-element
map_task
I'm also not getting much of that memory back after the workflow completes.... hovering at around 16.4GiB 10 minutes later.
t
f
Let me make an issue and look into this potential memory leak issue. Unsure if it's related to the issue we're seeing related to this issue. @wide-lion-54536 we just merged a fix in for the finalizers issue into our private fork. Will hopefully merge that into open source tomorrow. @tall-lock-23197 The ArrayNode map task implementation isn't using that config. Let me go ahead and add that in. Unsure if that were intentional or not
🙏 1
r
Not sure if these issues are related - but I've tested our OOM issue on different flyte binary versions and have not seen a difference between the versions. I'll try to re-deploy with
flyte-core
to get a bit more visiblity into which subsystem uses all of that memory.