Hi is there some documentation on running a part of the work Flyte #flyte-support

Hi, is there some documentation on running a part ...

sticky-angle-28419

05/06/2022, 9:13 PM

Hi, is there some documentation on running a part of the workflow? If there are 5 tasks in a workflow for example and I’d like to run the first 3 tasks to view the output of task #3, how could I do that? Also, where are the intermediate outputs for each task saved? I’m trying to visualize the intermediate task outputs and not sure if it’ll be performant enough to be near real-time for large datasets?

thankful-minister-83577

05/06/2022, 9:35 PM

so intermediate outputs are saved in the same place that workflow inputs/outputs are saved. there might be a bit of additional handling for workflow done by the control plane (@limited-dog-47035 can confirm) but for the most part it’s the same - offloaded data types like files and dataframes are stored in the bucket of your choice, and primitive input/output (along with pointers to those offloaded literals) is stored in Flyte’s s3 bucket as metadata.

thankful-minister-83577

05/06/2022, 9:36 PM

there’s currently no way to tell the system to just stop in the middle of a workflow. unless it fails. are you looking for a breakpoint or something like that?

thankful-minister-83577

05/06/2022, 9:36 PM

it should be easy enough however to register a new workflow with only the first three tasks.

thankful-minister-83577

05/06/2022, 9:37 PM

run results can be fetched later from admin or through the Python

FlyteRemote

object

thankful-minister-83577

05/06/2022, 9:37 PM

what do you mean by near-real-time?

sticky-angle-28419

05/06/2022, 9:39 PM

OK so here’s what I’m trying to achieve: For example, imagine trying to visualize the output data of task #3 in total of 5 tasks. Then I change the task code in task #3 which requires rerunning the workflow up to task #3 and if the intermediate data for task #1 and #2 is not in memory, then it would take a while to rerun the whole workflow up until task #3 to get that data visualized.

sticky-angle-28419

05/06/2022, 9:39 PM

So my use case is mainly data exploration inside a workflow.

sticky-angle-28419

05/06/2022, 9:40 PM

Changing the data processing and getting real-time visualization of that data.

sticky-angle-28419

05/06/2022, 9:40 PM

In intermediate steps of the workflow

sticky-angle-28419

05/06/2022, 9:40 PM

Hope that makes sense?

thankful-minister-83577

05/06/2022, 9:57 PM

i see

thankful-minister-83577

05/06/2022, 9:58 PM

when you say re-running, you mean remotely on a cluster? or locally on your laptop?

thankful-minister-83577

05/06/2022, 9:59 PM

in both instances, we support data memoization right? so if tasks 1 and 2 haven’t changed, then they are not re-run.

thankful-minister-83577

05/06/2022, 9:59 PM

however you cannot share caches between laptop and the flyte backend, not yet anyways

sticky-angle-28419

05/07/2022, 12:47 AM

Ah ok - I’m sorry if I’m being ignorant here but I guess I was just wondering where the data is cached if each task is run in a separate container (or am I getting this wrong)? If the task outputs are cached in disk, wouldn’t it be much slower to process than keeping the data in memory?

sticky-angle-28419

05/07/2022, 12:47 AM

By the way, I’m talking about running the workflow remotely in a Kubernetes cluster

sticky-angle-28419

05/07/2022, 12:48 AM

Thanks for answering my questions by the way! @thankful-minister-83577

thankful-minister-83577

05/07/2022, 2:23 AM

oh no… they’re cached in s3

thankful-minister-83577

05/07/2022, 2:24 AM

basically datacatalog is a pointer service, and propeller will check there first before running a cacheable task

thankful-minister-83577

05/07/2022, 2:24 AM

and yeah ofc.

sticky-angle-28419

05/07/2022, 2:25 AM

OK so we do need to bring the data over the network from s3 - hmm ok

thankful-minister-83577

05/07/2022, 2:25 AM

we’re a little async sometimes, so just ping again if you’re in a hurry

thankful-minister-83577

05/07/2022, 2:25 AM

and yes

thankful-minister-83577

05/07/2022, 2:26 AM

there are performance enhancements we can do for sure

thankful-minister-83577

05/07/2022, 2:26 AM

what’s the size of the data we’re talking about?

sticky-angle-28419

05/07/2022, 2:28 AM

Millions of rows?

thankful-minister-83577

05/07/2022, 2:30 AM

want to do a quick chat early next week?

thankful-minister-83577

05/07/2022, 2:31 AM

can talk about the use-case a bit more specifically.

sticky-angle-28419

05/07/2022, 2:31 AM

Let me do a bit more research here and get back to you

sticky-angle-28419

05/07/2022, 2:31 AM

Still have to figure out how to build this out

sticky-angle-28419

05/07/2022, 2:32 AM

I’ll play around with it a little and report back

thankful-minister-83577

05/07/2022, 2:36 AM

sure of course.

👍 1

thankful-minister-83577

05/07/2022, 2:37 AM

just want to understand the use case more.

freezing-airport-6809

05/09/2022, 12:18 AM

@sticky-angle-28419 seems you want to kind of build a debugger for Flyte. where you should be able to get data from any partial task locally? and then re-run it with code modifications

freezing-airport-6809

05/09/2022, 12:18 AM

this should be completely doable and we wanted to do this all the time 🙂

freezing-airport-6809

05/09/2022, 12:18 AM

UX is the most important thing

sticky-angle-28419

05/09/2022, 12:20 AM

Yes - basically data exploration with real-time data viz. A bit like Jupyter notebook really but with Flyte workflows. I’ll play around with this idea and report back

sticky-angle-28419

05/09/2022, 12:20 AM

Or enso.org if you know that one - although this one doesn’t run in a browser so a little different for sure

freezing-airport-6809

05/09/2022, 12:22 AM

@sticky-angle-28419 I am excited. This is doable. Wont be as performant - as the data is probably more. Remember Flyte has no custom storage locally - all data is stored in S3. You can create a separate storage layer and provide a driver for it, but today that is not available. On the other hand, performance is an illusion 🙂

sticky-angle-28419

05/09/2022, 12:23 AM

Lol

freezing-airport-6809

05/09/2022, 12:23 AM

haha jk

sticky-angle-28419

05/09/2022, 12:24 AM

It’ll definitely depend on the dataset size but maybe there are ways to work around it (at least for some types of data) by asynchronously loading data or having an intermediate caching layer - will report back if I get anything reasonable working

sticky-angle-28419

05/09/2022, 12:24 AM

Thanks for looking into this!

freezing-airport-6809

05/09/2022, 12:25 AM

that would be amazing, thank you for sharing

👍 1

170 Views

Open in Slack

Previous Next