Hello everyone slightly smiling face I m looking for some hi Flyte #flyte-support

Hello everyone :slightly_smiling_face: I’m looking...

strong-quill-48244

12/01/2022, 6:37 PM

Hello everyone 🙂 I’m looking for some higher level advice on how to structure one of our workflows. We have a wf that runs a lot (10-100 million) of repetitive tasks, for which the execution logic is very stable. For now, we split the original input list into a more suitable number (10-100) of chunks and use

map_task

to map the chunks to equally many worker nodes. However, now I would like to add another task into the mix. The task would be a ShellTask, which would prefetch data into Flyte filesystem for the worker nodes to use in processing. Reasoning behind this is that currently the fetching of the data happens inside of the processing loop, which creates a sizeable I/O bottleneck. The problem is that according to the docs, one should not call another task from inside a mapped task. So I’m looking for a more flexible approach to distribute processing to multiple pods, which would allow calling tasks from inside the worker nodes. I’ve looked into

@dynamic

and subworkflows. Which would be better, or is there a better option? Thanks a ton

strong-quill-48244

12/01/2022, 6:38 PM

From docs

Copy code

When defining a map task, avoid calling other tasks in it. Flyte can't accurately register tasks that call other tasks. While Flyte will correctly execute a task that calls other tasks, it will not be able to give full performance advantages. This is especially true for map tasks.

magnificent-teacher-86590

12/01/2022, 7:23 PM

are your tasks sharing same data? or its each task needs different data

strong-quill-48244

12/01/2022, 7:55 PM

Each task has mostly unique data! Although there are some workfow-wise params given to the tasks as well

freezing-airport-6809

12/01/2022, 8:06 PM

so is it like

Copy code

map(a -> b)
where a and b are individual tasks

freezing-airport-6809

12/01/2022, 8:07 PM

@dynamic can be done, it all depends on the size of your fanout

freezing-airport-6809

12/01/2022, 8:08 PM

map tasks are optimized for large fanout, but if the fanout is < 1k then probably dynamic task can work

freezing-airport-6809

12/01/2022, 8:08 PM

we do plan to make map tasks more powerful in the future

freezing-airport-6809

12/01/2022, 8:08 PM

basically work on anything - even a subworkflow

magnificent-teacher-86590

12/01/2022, 8:26 PM

if they have mostly unique data, what is the advantage of having it in separate task, would it just punt the bottleneck to the other task

strong-quill-48244

12/01/2022, 9:00 PM

@freezing-airport-6809 thanks for the insights! I'll give dynamic a try

strong-quill-48244

12/01/2022, 9:03 PM

@magnificent-teacher-86590 good point! The task to be extracted is fetching data from s3. Now we are using the python sdk boto3 but as I've understood the process can be optimized using aws cli tools with multiple connections/batch operations

strong-quill-48244

12/01/2022, 9:05 PM

As it will need cli commands I would rather extract the logic to another task!

magnificent-teacher-86590

12/01/2022, 9:06 PM

unless you share the storage space, it doesnt really work right? since each task is its own pod so will not have the fetched data

magnificent-teacher-86590

12/01/2022, 9:06 PM

i think you can include boto to flyte.cfg to enable parallel download and upload

153 Views

Open in Slack

Previous Next