What is the "flyte way" to handle workflows with a...
# flyte-support
a
What is the "flyte way" to handle workflows with a lot of data? We have several workflows that handles 10s of GBs, and they're failing due to the output size -
Copy code
is too large [28775519] bytes, max allowed [10485760] bytes
For now, we're passing a Flytefile instead of the actual data to overcome this issue, and I also understand another approach could be increasing the
max-output-size-bytes
parameter, but this is only temporary, as data in the future could succeed this threshold. So - What is the proper way to handle large I/O?
f
What is this data that you return that is unlined and 10gb
Flyte will offload data
Only metadata is passed between tasks
a
lists of jsons (dicts)
if so, what could be the reason of the max allowed size exception?
f
Aah yes JSON is passed inline
How many items in the list
We are indeed working on auto offloading support for large lists etc
Cc @high-park-82026 @acceptable-policeman-57188 @thankful-minister-83577
We are also working on more compact representation of json
a
hundred of thousands jsons.. is there maybe a different data type that you support that will offload?
f
We have never seen a json list of 10gb
Yes data frames
Or jsonl
Or file
Or csv
a
I'll try using a dataframe, but I wish it could be supported with some data type that supports typing hints.. could really help us (like list of dicts where I can define the schema)
h
If you are going to go with a dataframe, you can look at StructuredDatasets, they support strongly typed schemas that are compile-type validated... You can even combine it with Pandera to define rules around validation and have flyte automatically kick these off.
a
What about Pydantic data models? Is there a place I can see which types are transfered in-line and which are offloaded?