What is the flyte way to handle workflows with a lot of data Flyte #flyte-support

What is the "flyte way" to handle workflows with a...

adamant-sugar-80274

09/23/2024, 2:27 PM

What is the "flyte way" to handle workflows with a lot of data? We have several workflows that handles 10s of GBs, and they're failing due to the output size -

Copy code

is too large [28775519] bytes, max allowed [10485760] bytes

For now, we're passing a Flytefile instead of the actual data to overcome this issue, and I also understand another approach could be increasing the

max-output-size-bytes

parameter, but this is only temporary, as data in the future could succeed this threshold. So - What is the proper way to handle large I/O?

freezing-airport-6809

09/23/2024, 2:33 PM

What is this data that you return that is unlined and 10gb

freezing-airport-6809

09/23/2024, 2:33 PM

Flyte will offload data

freezing-airport-6809

09/23/2024, 2:34 PM

Only metadata is passed between tasks

adamant-sugar-80274

09/23/2024, 2:35 PM

lists of jsons (dicts)

adamant-sugar-80274

09/23/2024, 2:35 PM

if so, what could be the reason of the max allowed size exception?

freezing-airport-6809

09/23/2024, 2:36 PM

Aah yes JSON is passed inline

freezing-airport-6809

09/23/2024, 2:36 PM

How many items in the list

freezing-airport-6809

09/23/2024, 2:36 PM

We are indeed working on auto offloading support for large lists etc

freezing-airport-6809

09/23/2024, 2:37 PM

Cc @high-park-82026 @acceptable-policeman-57188 @thankful-minister-83577

freezing-airport-6809

09/23/2024, 2:37 PM

We are also working on more compact representation of json

adamant-sugar-80274

09/23/2024, 2:37 PM

hundred of thousands jsons.. is there maybe a different data type that you support that will offload?

freezing-airport-6809

09/23/2024, 2:37 PM

We have never seen a json list of 10gb

freezing-airport-6809

09/23/2024, 2:37 PM

Yes data frames

freezing-airport-6809

09/23/2024, 2:37 PM

Or jsonl

freezing-airport-6809

09/23/2024, 2:38 PM

Or file

freezing-airport-6809

09/23/2024, 2:38 PM

Or csv

adamant-sugar-80274

09/23/2024, 2:39 PM

I'll try using a dataframe, but I wish it could be supported with some data type that supports typing hints.. could really help us (like list of dicts where I can define the schema)

high-park-82026

09/23/2024, 4:21 PM

If you are going to go with a dataframe, you can look at StructuredDatasets, they support strongly typed schemas that are compile-type validated... You can even combine it with Pandera to define rules around validation and have flyte automatically kick these off.

adamant-sugar-80274

09/23/2024, 4:22 PM

What about Pydantic data models? Is there a place I can see which types are transfered in-line and which are offloaded?

high-park-82026

09/23/2024, 5:24 PM

https://flyte-org.slack.com/archives/C06H1SFA19R/p1727112164629489?thread_ts=1727112163.247309&cid=C06H1SFA19R This is probably the closest to what you are looking for: https://docs.flyte.org/en/latest/concepts/data_management.html

124 Views

Open in Slack

Previous Next