Hey <@U0265RTUJ5B>, :thread: for the issue we're e...
# flytekit
r
Hey @Eduardo Apolinario (eapolinario), 🧵 for the issue we're experiencing related to the stalled task when returning a DataFrame
We have a library that uses
awswrangler
to read dataframes from our data warehouse (in s3), and previously had a basic task that we used to load datasets:
Copy code
@task
def load_from_warehouse(warehouse_name: str) -> pd.DataFrame:
  dataset = warehouse_library.dataset(warehouse_name)
  return dataset.read_dataframe()
Where
read_dataframe
calls awswrangler under the hood. This previously worked on flytekit 1.1.0, and when we upgraded to flytekit 1.2.0 the identical task took several orders of magnitude longer to complete (from 120s -> 90+ minutes) I'm curious if there was a regression introduced that led to a significant performance issue when saving dataframes to parquet. We tested just rolling back flytekit from 1.2.0 -> 1.1.0 and this resolved the issue for us. Another data point is that tasks that had
pd.DataFrame
either as an input or output were affected
k
cc @Yee I’ll take a look as well.
r
@Kevin Su happy to provide any extra details or test a patch out on the same workloads
k
This is interesting
e
@Rahul Mehta, can you list the packages installed (e.g. pip list) in each image? Also, which python version is this?
r
This is on python 3.9.13, will get you a list of the packages and the diff tmrw
e
@Rahul Mehta, we were able to repro this. Fix incoming.
r
Thank you! I still hadn't been able to get to pulling the packages, very curious what this was....
e
@Rahul Mehta, this boils down to this PR. Essentially we generate a flyte deck for pandas dataframes (via the
StructuredDataset
construct). If you don't care about the automatically-generated deck (and it looks like you don't) you can pass
disable_deck=True
to the
@task
that produces the dataframe
r
@Eduardo Apolinario (eapolinario) hmm interesting, I set
disable_deck=True
in our global decorator that we use (so we can centralize configs like this) and the issue seems to persist w/ the exact same memory usage pattern
e
@Rahul Mehta, I'm sorry I mislead you. Instead, just to unblock you, can you annotate your task with a
TopFrameRenderer
call like:
Copy code
from flytekit.deck import TopFrameRenderer

@task
def t() -> Annotated[pd.DataFrame, TopFrameRenderer(10)]:
we'll be working on a default renderer for dataframes that doesn't involve printing the entire dataframe ,but for now, you can force this renderer to only emit 10 rows (the top+bottom 5).
r
Will try tomorrow, thanks @Eduardo Apolinario (eapolinario). Seems like disable_deck still calls the code path?
e
Correct, the thing is that we only disable the decks after generating them. lolcry
k
Ohh no
r
I’d strongly prefer this be opt-in behavior vs opt out FWIW
We can enforce this with our own internal decorator though
k
I agree with the opt in
Cc @Eduardo Apolinario (eapolinario) / @Yee
r
Thanks for the suggestion @Eduardo Apolinario (eapolinario), this resolved the issue. We created a custom type alias for
pd.DataFrame
in a
types
module
Copy code
DataFrame = Annotated[pd.DataFrame, deck.TopFrameRenderer(10)]
It'd be nice to not require using this in place of pd.DataFrame (it's a flyte-specific detail our team needs to remember), so curious what the longer-term fix here is
k
There is definitely a fix
And I am in favor of disabling flytedecks by default
e
What if we turned the cost of producing default decks for pandas dataframes basically free? That's the approach I took in https://github.com/flyteorg/flytekit/pull/1238, where we just set reasonable defaults for both the max number of rows and columns.
r
@Eduardo Apolinario (eapolinario) that seems sane -- if you guys cut an RC w/ this change I can test it out on the same workload
k
@Eduardo Apolinario (eapolinario) there is still a cost to writing the deck and uploading it? Why not default to disable?
e
IMO there's value in getting a sense of the shape of the data at that cost. Keep in mind that the default generates an html of a few KB.
k
But it needs a disc and s3 write. This is a few 100ms at least?
e
yeah, this is true, but the cost to write the deck is amortized in any interesting workflow. Obviously, in the future, when we decide to target the absolute performance for workflows we can revisit this decision of generating decks for all basic Flyte types by default.
t
Also ran into this now with a 4m row dataframe wondering why my task takes so long and uses so much memory 😄 I want to second the point about opting in / making this 5-10 rows by default and optionally more.
Your suggested fix only works for the output decks being generated.
Next task that consumes that as input ends using the default renderer. I have a df with
[4978309 rows x 15 columns]
- calling to_html on it takes upwards of 10 minutes for me, at least that's when I ragequit the debugger.
I think it would be great if you can generally try to limit the pyflyte execution overhead as much as possible. With its typing system and promises, Flyte is already confusing to many - if it now adds noticeable overhead to everything or worse overhead that scales significantly with task inputs and outputs I can see many DS silently commenting out their @task decorators for local development - and moving back to other tools. That said I love flyte and your work on Decks. It would be sad if people don't get to experience the useful parts of that feature because they get stuck on things like this.
Just found that, will upgrade
156 Views