Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

I have a use case that requires very fine-grained caching and was wondering if a dynamic workflow spawning thousands of task is okay?
• I have a pandas dataframe of 50k items - each row contains a sentence that I want to apply expensive operations on (think of passing each sentence through an external LLM service)
• Across my experiments the order and contents of the dataframe containing sentences can change but I still want to cache hit on the subset of sentences that are already seen (for example I have to shuffle and random split my dataset for validation)
Any ideas?

Hey! Why not a <https://docs.flyte.org/projects/flytekit/en/latest/generated/flytekit.map_task.html#flytekit-map-task|map task> for this?

Interesting :thinking_face:
• Do map tasks also support task level caches?
• Are they able to support thousands of tasks without blowing up the graph? 

image.png

Don’t know about the second one
<https://flyte.org/blog/map-tasks-in-flyte> . Caching works for those

&gt; Are they able to support thousands of tasks without blowing up the graph?
I believe so!

Just tested it out with 2k nodes and it worked perfectly :+1: gonna try out with 50k today :crossed_fingers: