Hi, is there some documentation on running a part ...
# ask-the-community
s
Hi, is there some documentation on running a part of the workflow? If there are 5 tasks in a workflow for example and I’d like to run the first 3 tasks to view the output of task #3, how could I do that? Also, where are the intermediate outputs for each task saved? I’m trying to visualize the intermediate task outputs and not sure if it’ll be performant enough to be near real-time for large datasets?
y
so intermediate outputs are saved in the same place that workflow inputs/outputs are saved. there might be a bit of additional handling for workflow done by the control plane (@Katrina P can confirm) but for the most part it’s the same - offloaded data types like files and dataframes are stored in the bucket of your choice, and primitive input/output (along with pointers to those offloaded literals) is stored in Flyte’s s3 bucket as metadata.
there’s currently no way to tell the system to just stop in the middle of a workflow. unless it fails. are you looking for a breakpoint or something like that?
it should be easy enough however to register a new workflow with only the first three tasks.
run results can be fetched later from admin or through the Python
FlyteRemote
object
what do you mean by near-real-time?
s
OK so here’s what I’m trying to achieve: For example, imagine trying to visualize the output data of task #3 in total of 5 tasks. Then I change the task code in task #3 which requires rerunning the workflow up to task #3 and if the intermediate data for task #1 and #2 is not in memory, then it would take a while to rerun the whole workflow up until task #3 to get that data visualized.
So my use case is mainly data exploration inside a workflow.
Changing the data processing and getting real-time visualization of that data.
In intermediate steps of the workflow
Hope that makes sense?
y
i see
when you say re-running, you mean remotely on a cluster? or locally on your laptop?
in both instances, we support data memoization right? so if tasks 1 and 2 haven’t changed, then they are not re-run.
however you cannot share caches between laptop and the flyte backend, not yet anyways
s
Ah ok - I’m sorry if I’m being ignorant here but I guess I was just wondering where the data is cached if each task is run in a separate container (or am I getting this wrong)? If the task outputs are cached in disk, wouldn’t it be much slower to process than keeping the data in memory?
By the way, I’m talking about running the workflow remotely in a Kubernetes cluster
Thanks for answering my questions by the way! @Yee
y
oh no… they’re cached in s3
basically datacatalog is a pointer service, and propeller will check there first before running a cacheable task
and yeah ofc.
s
OK so we do need to bring the data over the network from s3 - hmm ok
y
we’re a little async sometimes, so just ping again if you’re in a hurry
and yes
there are performance enhancements we can do for sure
what’s the size of the data we’re talking about?
s
Millions of rows?
y
want to do a quick chat early next week?
can talk about the use-case a bit more specifically.
s
Let me do a bit more research here and get back to you
Still have to figure out how to build this out
I’ll play around with it a little and report back
y
sure of course.
👍 1
just want to understand the use case more.
k
@seunggs seems you want to kind of build a debugger for Flyte. where you should be able to get data from any partial task locally? and then re-run it with code modifications
this should be completely doable and we wanted to do this all the time 🙂
UX is the most important thing
s
Yes - basically data exploration with real-time data viz. A bit like Jupyter notebook really but with Flyte workflows. I’ll play around with this idea and report back
Or enso.org if you know that one - although this one doesn’t run in a browser so a little different for sure
k
@seunggs I am excited. This is doable. Wont be as performant - as the data is probably more. Remember Flyte has no custom storage locally - all data is stored in S3. You can create a separate storage layer and provide a driver for it, but today that is not available. On the other hand, performance is an illusion 🙂
s
Lol
k
haha jk
s
It’ll definitely depend on the dataset size but maybe there are ways to work around it (at least for some types of data) by asynchronously loading data or having an intermediate caching layer - will report back if I get anything reasonable working
Thanks for looking into this!
k
that would be amazing, thank you for sharing
👍 1
166 Views