Hi all! I have a question regarding parallel processing with Flyte.
I have a generator that yields rows of data which I need to process further with Python and then insert into a database table. The amount of data is quite large (approximately 150GB), so I would like to process it in batches.
Specifically, I want the generator to yield, say, a million rows that would be passed to a Flyte task (or any other suitable construct). This task would start a container to process that batch, reset the data in memory, and then the generator would yield the next million rows to be assigned to a second container, and so on.
My understanding is that using a dynamic workflow or map_task requires all the data to be loaded “in memory” before any containers are started, which means the generator needs to be fully exhausted first. This ruins the idea of batch processing. Why I would like to use Flyte in such a way for batch processing is to avoids the overhead of converting traditional Python code to PySpark. However, it seems challenging to achieve this with the current Flyte capabilities. Is there something I have missed?
For context, the generator lists a large number of S3 files and reads the rows to yield from them. I know one solution would be to map the file names to each task, but I would like to avoid rewriting the logic like that, as I am using a similar generator construction in many places and would prefer not to restructure everything.
Is there a way to achieve this batch processing approach with Flyte, where containers start processing batches incrementally without requiring the entire dataset to be loaded upfront? Any advice or suggestions would be greatly appreciated!
Thank you!