Hello everyone :slightly_smiling_face: I’m looking...
# ask-the-community
e
Hello everyone 🙂 I’m looking for some higher level advice on how to structure one of our workflows. We have a wf that runs a lot (10-100 million) of repetitive tasks, for which the execution logic is very stable. For now, we split the original input list into a more suitable number (10-100) of chunks and use
map_task
to map the chunks to equally many worker nodes. However, now I would like to add another task into the mix. The task would be a ShellTask, which would prefetch data into Flyte filesystem for the worker nodes to use in processing. Reasoning behind this is that currently the fetching of the data happens inside of the processing loop, which creates a sizeable I/O bottleneck. The problem is that according to the docs, one should not call another task from inside a mapped task. So I’m looking for a more flexible approach to distribute processing to multiple pods, which would allow calling tasks from inside the worker nodes. I’ve looked into
@dynamic
and subworkflows. Which would be better, or is there a better option? Thanks a ton
From docs
Copy code
When defining a map task, avoid calling other tasks in it. Flyte can't accurately register tasks that call other tasks. While Flyte will correctly execute a task that calls other tasks, it will not be able to give full performance advantages. This is especially true for map tasks.
j
are your tasks sharing same data? or its each task needs different data
e
Each task has mostly unique data! Although there are some workfow-wise params given to the tasks as well
k
so is it like
Copy code
map(a -> b)
where a and b are individual tasks
@dynamic can be done, it all depends on the size of your fanout
map tasks are optimized for large fanout, but if the fanout is < 1k then probably dynamic task can work
we do plan to make map tasks more powerful in the future
basically work on anything - even a subworkflow
j
if they have mostly unique data, what is the advantage of having it in separate task, would it just punt the bottleneck to the other task
e
@Ketan (kumare3) thanks for the insights! I'll give dynamic a try
@Jay Ganbat good point! The task to be extracted is fetching data from s3. Now we are using the python sdk boto3 but as I've understood the process can be optimized using aws cli tools with multiple connections/batch operations
As it will need cli commands I would rather extract the logic to another task!
j
unless you share the storage space, it doesnt really work right? since each task is its own pod so will not have the fetched data
i think you can include boto to flyte.cfg to enable parallel download and upload