wide-lion-54536
08/16/2024, 3:28 PMmap_task
with progressively larger batches. In the graphs below, the first ramp is ~1,000 elements, the second is ~10,000, and the current is ~20,000. The task is straightforward: It sleeps between 0.25 and 0.5 seconds and returns a dataclass. The memory usage seems disproportionate with the job.wide-lion-54536
08/16/2024, 3:30 PM<http://cr.flyte.org/flyteorg/flyte-binary-release:v1.13.0|cr.flyte.org/flyteorg/flyte-binary-release:v1.13.0>
image running via the flyte-binary-v1.13.0
chart.wide-lion-54536
08/16/2024, 3:33 PM@dataclass
class Result(DataClassJSONMixin):
ok: bool
msg: str
meta: dict[str, str | int | bool]
@task
def do_work(id: int) -> Optional[Result]:
sleep(random.uniform(0.25, 0.5))
if id % 23 == 0:
# simulated error
return None
return Result(
ok=True, msg=f"Hello, {id}", meta={"foo": "bar", "baz": id, "qux": True}
)
@workflow
def do_flat_fanout(n: int = 1000) -> float:
ids = generate_ids(n=n)
results = map_task(do_work, concurrency=50)(id=ids)
fr = filter_results(results=results)
return reduce_results(results=fr)
wide-lion-54536
08/16/2024, 3:35 PMpyflyte run --remote odyssey_data/flyte/examples/fanout.py do_flat_fanout --n 20037
average-finland-92144
08/16/2024, 4:13 PMwide-lion-54536
08/16/2024, 4:35 PMflyte
namespace and don't see much during the time span in the graphs above until the pod's restarted at 08:23.
Common labels: {"app":"flyte-binary","component":"flyte-binary","container":"flyte","filename":"/var/log/pods/flyte_flyte-binary-75855f5745-wjw9f_71d7974a-1eac-4af6-a8d6-fc9b4d09ab7c/flyte/0.log","instance":"flyte-binary","job":"flyte/flyte-binary","namespace":"flyte","node_name":"aks-cpularge-32799272-vmss000018","pod":"flyte-binary-75855f5745-wjw9f","stream":"stderr"}
Line limit: "2000 (7 displayed)"
Total bytes processed: "1.67 kB"
2024-08-16 07:01:23.963 W0816 14:01:23.963802 7 reflector.go:458] pkg/mod/k8s.io/client-go@v0.28.2/tools/cache/reflector.go:229: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: got short buffer with n=0, base=4092, cap=40960") has prevented the request from succeeding
2024-08-16 07:03:56.197 W0816 14:03:56.196876 7 reflector.go:458] pkg/mod/k8s.io/client-go@v0.28.2/tools/cache/reflector.go:229: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: got short buffer with n=0, base=4092, cap=20480") has prevented the request from succeeding
2024-08-16 07:44:58.647 W0816 14:44:58.646960 7 reflector.go:458] pkg/mod/k8s.io/client-go@v0.28.2/tools/cache/reflector.go:229: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: got short buffer with n=0, base=4092, cap=20480") has prevented the request from succeeding
2024-08-16 08:14:30.538 W0816 15:14:30.538669 7 reflector.go:458] pkg/mod/k8s.io/client-go@v0.28.2/tools/cache/reflector.go:229: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: got short buffer with n=0, base=4092, cap=20480") has prevented the request from succeeding
2024-08-16 08:23:27.170 E0816 15:23:27.170185 7 workers.go:103] error syncing 'flytesnacks-development/f675591f52ec2476baf4': worker error(s) encountered: [5]: 0:
2024-08-16 08:23:27.170 Operation cannot be fulfilled on pods "f675591f52ec2476baf4-n1-0-n12071-0": the object has been modified; please apply your changes to the latest version and try again
2024-08-16 08:23:27.170
average-finland-92144
08/16/2024, 4:39 PMinject-finalizers: true
? we've seen this same error message `the object has been modified; please apply your changes to the latest version and try again`in situations of high concurrency. While it's still under investigation, disabling finalizers seem to help as a temporary measurewide-lion-54536
08/16/2024, 4:42 PMinline.plugins.k8s.inject-finalizer: true
average-finland-92144
08/16/2024, 4:43 PMfalse
and test again?wide-lion-54536
08/16/2024, 4:44 PMconcurrency=50
considered high?average-finland-92144
08/16/2024, 4:45 PMwide-lion-54536
08/16/2024, 6:38 PMinject-finalizer
(thanks!), but I'm still looking at what seems to be excessive memory pressure in flyte-binary, peaking at 19.5GiB from a clean start for a 20,079-element map_task
I'm also not getting much of that memory back after the workflow completes.... hovering at around 16.4GiB 10 minutes later.tall-lock-23197
flat-area-42876
08/20/2024, 3:24 AMripe-smartphone-56353
08/20/2024, 1:18 PMflyte-core
to get a bit more visiblity into which subsystem uses all of that memory.