Hi I just experienced this error ```failed at Node n4 Bindin Flyte #flyte-support

Hi, I just experienced this error: ```failed at ...

white-teacher-47376

07/11/2023, 3:08 PM

Hi, I just experienced this error:

Copy code

failed at Node[n4]. BindingResolutionError: Error binding Var [wf].[inp], caused by: failed at Node[n2]. CausedByError: Failed to GetPrevious data from outputDir [<s3://flytemeta/metadata/propeller/ohli-training-development-a9czrgtghc5dzmxhdflf/n2/data/0/outputs.pb>], caused by: path:<s3://flytemeta/metadata/propeller/ohli-training-development-a9czrgtghc5dzmxhdflf/n2/data/0/outputs.pb>: [LIMIT_EXCEEDED] limit exceeded. 8mb > 2mb.

It looks to me, as if the output (or actually input, since dumping the output to s3 worked fine) using native python literals cannot exceed 2MB after serialization, is that correct? I am trying to pass a list of thousands of paths from one task to another, which apparently takes about 9MB. Is it possible to increase this limit or perhaps the best practice in such a scenario would be to use a Dataframe? Thanks

agreeable-kitchen-44189

07/11/2023, 3:11 PM

We’ve run into this before - we’ve either used a dataframe or just dumped it into a flat file and passed that around as a Flyte file

hallowed-mouse-14616

07/11/2023, 3:21 PM

There is also a configuration value,

max-output-size-bytes

, at the root of the propeller configuraiton.

freezing-airport-6809

07/11/2023, 3:24 PM

We are working on making this smarter and offloading the metadata when large automatically

white-teacher-47376

07/11/2023, 3:39 PM

Thanks a lot for the quick replies! Will try to increase this limit - for even bigger data, a dataframe sounds like a good alternative

microscopic-furniture-57275

09/25/2023, 8:03 PM

I'm running into this issue as well. I don't understand why my results are as large as Flyte reports (limit_exceeded. 22mb > 2mb). My dynamic workflow is calling a "combine" task with a list of dataclass-based result objects which were returned by individual tasks called while looping over inputs. My questions are: • how can I programmatically measure the size of the result objects myself? Presumably this is the size of the result object once serialized for passing between nodes in the Flyte DAG. My result objects contain a custom field that allow me to lazily-load a Dataframe on demand, but this Dataframe shouldn't be included in the result size; only the filename is stored. Can I serialized a result object to disk (via marshmallow? or?) and then look at the filesize to understand the size of my results as seen by Flyte? • @hallowed-mouse-14616 indicates this problem can also be addressed by increasing the

max-output-size-bytes

value. But the file referenced seems to indicate a 10mb default, where my error message (and the OP above) indicates 2mb as the limit, so is this in fact the correct value to change?

microscopic-furniture-57275

09/25/2023, 10:07 PM

Maybe my case is slightly different and I should post to a new thread. Mine is not failing to GetPrevious due to outputs size, mine is failing to read the futures file, but also because of this same size limit. Is this saying that the size of all results being passed out of node n1 is too large? I'm not sure how to interpret this message, and how to confirm that the results are too large.

Copy code

Workflow[io-erisyon-people-thomas:development:plaster.genv2.generators.survey.survey_workflow] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: [system] unable to read futures file, maybe corrupted, caused by: [system] Failed to read futures protobuf file., caused by: path:<s3://informatics-flytes3bucket-2133a04/metadata/propeller/io-erisyon-people-thomas-development-f3a22869c5041400ca49/n1/data/0/futures.pb>: [LIMIT_EXCEEDED] limit exceeded. 24mb > 2mb.

microscopic-furniture-57275

09/25/2023, 10:10 PM

This error message occurs almost immediately after node n1 begins running (it is a dynamic workflow) -- it creates a number of tasks, and will eventually pass the results of those tasks to the combine task. I've been guessing that it is this combination of results that is too large, but I tried changing to pass only the path to the results, and strangely, get the same size error (with the same 24mb > 2mb) , so I may be wrong.

hallowed-mouse-14616

09/28/2023, 1:36 PM

@microscopic-furniture-57275 the

futures.pb

file in a dynamic workflow is the compiled workflow. So this error may not have anything to do with the result sizes passed between tasks, but rather the overall size of the dynamic. Do you know how many nodes are included in the dynamic workflow? There is different configuration for reading this data, hence the 2MB limit. I think it is being set as maxDownloadMBs (the example has 10MB, but check your configuration please).

microscopic-furniture-57275

09/28/2023, 3:17 PM

Hi @hallowed-mouse-14616, thanks for your response. The dynamic in question was running perhaps 36 nodes -- 4 tasks each for 9 permutations of input, with a goal to gather these results in separate task once those are all complete. But this is actually a small example, and can/will get considerably larger in the future (more permutations of input), so maybe this is not the best architecture for this type of job. Either way, I'd like to understand more fully what the culprit is.

hallowed-mouse-14616

09/29/2023, 2:36 PM

So the

futures.pb

file is stored in the blobstore. Basically, in a dynamic flytekit will execute the task and compile the workflow dynamically, then propeller notices the existence of a

futures.pb

file and continues in executing the subnodes for the dynamic. In my local example (below):

Copy code

@task
def square(n: int) -> int:
    return n * n

@dynamic
def dynamic_parallel_square(a: int, count: int) -> int:
    for i in range(count):
        square(n=a)
    return a
    
@workflow
def dynamic_parallel_square_wf(a: int, count: int) -> int:
    return dynamic_parallel_square(a=a, count=count)

I execute on a local cluster with

Copy code

(.venv) hamersaw@ragnarok:~/Development/flytetest$ pyflyte run --remote dynamic.py dynamic_parallel_square_wf --a=2 --count=3`
Running Execution on [Remote]. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[✔] Go to <http://localhost:30081/console/projects/flytesnacks/domains/development/executions/f79eae60587bb4140897> to see execution in the console.

The location of the

futues.pb

file is

my-s3-bucket/metadata/propeller/flytesnacks-development-f79eae60587bb4140897/n0/data/0/futures.pb

in my minio - read as the

flytesnacks-developemnt-f...

execution and data for the first attempt on node

n0

. I read this locally and its only 1.1KB. You can use the

flyte-cli

utility (should be installed with pyflyte, but will be deprecated relatively soon) to parse the protobuf file and print json with a command similar to:

Copy code

(.venv) hamersaw@ragnarok:~/Development/flytetest$ flyte-cli parse-proto --filename ~/Downloads/futures.pb --proto_class flyteidl.core.dynamic_job_pb2.DynamicJobSpec

This uses flyteidl to print the

DynamicJobSpec

located here. This should help identify exactly what is being stored for the dynamic.

microscopic-furniture-57275

09/29/2023, 8:09 PM

Hey @hallowed-mouse-14616, thanks so much for this very detailed reply, it is immensely helpful. I've managed to locate the futures.pb file for my own failing workflow and examined it with the flyte-cli tool you suggested, and found immediately that it contains full copies of the input data to each task created during the dynamic, and these contain huge amounts of biological sequence data -- so I'll think about how to spec this differently.

microscopic-furniture-57275

09/29/2023, 8:26 PM

As for the limit being reported, I'm still not sure why the error message says

[LIMIT_EXCEEDED] limit exceeded. 24mb > 2mb.

We are not changing much of anything from the flyte helm-chart defaults, and per the link you provided, this value would seem to default to 10mb.

microscopic-furniture-57275

09/29/2023, 8:40 PM

But a search on this error message points to the same config option you've referenced.

hallowed-mouse-14616

10/02/2023, 11:26 AM

I know the old default configuration was 2mb. It is very possible that is the configuration in your deployment.

microscopic-furniture-57275

10/02/2023, 11:28 AM

Thanks, I'll look more closely. I didn't do the deployment, and it's unclear to me if there is a single yaml file somewhere that reflects ALL of the actual values used in the deployment. Instead, I have a typescript file used by Pulumi that overrides only a couple values from the default helm charts, and our last installation was of version 1.9.1 a week or two ago, so it's quite recent.

91 Views

Open in Slack

Previous Next