Hey < high accountant 32689> thread for the issue we re expe Flyte #flytekit

Hey <@U0265RTUJ5B>, :thread: for the issue we're e...

elegant-australia-91422

10/13/2022, 2:05 AM

Hey @high-accountant-32689, 🧵 for the issue we're experiencing related to the stalled task when returning a DataFrame

elegant-australia-91422

10/13/2022, 2:08 AM

We have a library that uses

awswrangler

to read dataframes from our data warehouse (in s3), and previously had a basic task that we used to load datasets:

Copy code

@task
def load_from_warehouse(warehouse_name: str) -> pd.DataFrame:
  dataset = warehouse_library.dataset(warehouse_name)
  return dataset.read_dataframe()

Where

read_dataframe

calls awswrangler under the hood. This previously worked on flytekit 1.1.0, and when we upgraded to flytekit 1.2.0 the identical task took several orders of magnitude longer to complete (from 120s -> 90+ minutes) I'm curious if there was a regression introduced that led to a significant performance issue when saving dataframes to parquet. We tested just rolling back flytekit from 1.2.0 -> 1.1.0 and this resolved the issue for us. Another data point is that tasks that had

pd.DataFrame

either as an input or output were affected

elegant-australia-91422

10/13/2022, 2:09 AM

Also notable was the memory usage pattern we observed (linked in the other thread but copied here).

elegant-australia-91422

10/13/2022, 2:10 AM

For comparison, here's the resource utilization once we downgraded flytekit. kind of smells like a memory leak somewhere?

glamorous-carpet-83516

10/13/2022, 2:26 AM

cc @thankful-minister-83577 I’ll take a look as well.

🙏 1

elegant-australia-91422

10/13/2022, 2:38 AM

@glamorous-carpet-83516 happy to provide any extra details or test a patch out on the same workloads

freezing-airport-6809

10/13/2022, 3:00 AM

This is interesting

high-accountant-32689

10/13/2022, 4:50 AM

@elegant-australia-91422, can you list the packages installed (e.g. pip list) in each image? Also, which python version is this?

elegant-australia-91422

10/13/2022, 5:04 AM

This is on python 3.9.13, will get you a list of the packages and the diff tmrw

👍 1

high-accountant-32689

10/13/2022, 10:07 PM

@elegant-australia-91422, we were able to repro this. Fix incoming.

🚀 1

👍 1

elegant-australia-91422

10/13/2022, 10:14 PM

Thank you! I still hadn't been able to get to pulling the packages, very curious what this was....

high-accountant-32689

10/13/2022, 11:26 PM

@elegant-australia-91422, this boils down to this PR. Essentially we generate a flyte deck for pandas dataframes (via the

StructuredDataset

construct). If you don't care about the automatically-generated deck (and it looks like you don't) you can pass

disable_deck=True

to the

@task

that produces the dataframe

👍 1

elegant-australia-91422

10/14/2022, 2:06 PM

@high-accountant-32689 hmm interesting, I set

disable_deck=True

in our global decorator that we use (so we can centralize configs like this) and the issue seems to persist w/ the exact same memory usage pattern

high-accountant-32689

10/14/2022, 11:56 PM

@elegant-australia-91422, I'm sorry I mislead you. Instead, just to unblock you, can you annotate your task with a

TopFrameRenderer

call like:

Copy code

from flytekit.deck import TopFrameRenderer

@task
def t() -> Annotated[pd.DataFrame, TopFrameRenderer(10)]:

high-accountant-32689

10/14/2022, 11:58 PM

we'll be working on a default renderer for dataframes that doesn't involve printing the entire dataframe ,but for now, you can force this renderer to only emit 10 rows (the top+bottom 5).

elegant-australia-91422

10/15/2022, 12:00 AM

Will try tomorrow, thanks @high-accountant-32689. Seems like disable_deck still calls the code path?

high-accountant-32689

10/15/2022, 12:05 AM

Correct, the thing is that we only disable the decks after generating them. lolcry

😅 1

freezing-airport-6809

10/15/2022, 12:32 AM

Ohh no

elegant-australia-91422

10/15/2022, 1:57 AM

I’d strongly prefer this be opt-in behavior vs opt out FWIW

elegant-australia-91422

10/15/2022, 1:57 AM

We can enforce this with our own internal decorator though

freezing-airport-6809

10/15/2022, 2:40 PM

I agree with the opt in

freezing-airport-6809

10/15/2022, 2:41 PM

Cc @high-accountant-32689 / @thankful-minister-83577

elegant-australia-91422

10/15/2022, 9:42 PM

Thanks for the suggestion @high-accountant-32689, this resolved the issue. We created a custom type alias for

pd.DataFrame

in a

types

module

Copy code

DataFrame = Annotated[pd.DataFrame, deck.TopFrameRenderer(10)]

It'd be nice to not require using this in place of pd.DataFrame (it's a flyte-specific detail our team needs to remember), so curious what the longer-term fix here is

freezing-airport-6809

10/15/2022, 11:10 PM

There is definitely a fix

freezing-airport-6809

10/15/2022, 11:10 PM

And I am in favor of disabling flytedecks by default

high-accountant-32689

10/17/2022, 6:46 PM

What if we turned the cost of producing default decks for pandas dataframes basically free? That's the approach I took in https://github.com/flyteorg/flytekit/pull/1238, where we just set reasonable defaults for both the max number of rows and columns.

👍 2

elegant-australia-91422

10/18/2022, 12:14 AM

@high-accountant-32689 that seems sane -- if you guys cut an RC w/ this change I can test it out on the same workload

freezing-airport-6809

10/18/2022, 12:28 AM

@high-accountant-32689 there is still a cost to writing the deck and uploading it? Why not default to disable?

high-accountant-32689

10/18/2022, 12:31 AM

IMO there's value in getting a sense of the shape of the data at that cost. Keep in mind that the default generates an html of a few KB.

freezing-airport-6809

10/18/2022, 1:20 PM

But it needs a disc and s3 write. This is a few 100ms at least?

high-accountant-32689

10/18/2022, 11:51 PM

yeah, this is true, but the cost to write the deck is amortized in any interesting workflow. Obviously, in the future, when we decide to target the absolute performance for workflows we can revisit this decision of generating decks for all basic Flyte types by default.

worried-restaurant-93221

11/29/2022, 2:39 PM

Also ran into this now with a 4m row dataframe wondering why my task takes so long and uses so much memory 😄 I want to second the point about opting in / making this 5-10 rows by default and optionally more.

worried-restaurant-93221

11/29/2022, 2:51 PM

Your suggested fix only works for the output decks being generated.

worried-restaurant-93221

11/29/2022, 2:54 PM

Next task that consumes that as input ends using the default renderer. I have a df with

[4978309 rows x 15 columns]

- calling to_html on it takes upwards of 10 minutes for me, at least that's when I ragequit the debugger.

worried-restaurant-93221

11/29/2022, 3:03 PM

I think it would be great if you can generally try to limit the pyflyte execution overhead as much as possible. With its typing system and promises, Flyte is already confusing to many - if it now adds noticeable overhead to everything or worse overhead that scales significantly with task inputs and outputs I can see many DS silently commenting out their @task decorators for local development - and moving back to other tools. That said I love flyte and your work on Decks. It would be sad if people don't get to experience the useful parts of that feature because they get stuck on things like this.

worried-restaurant-93221

11/29/2022, 3:17 PM

https://github.com/flyteorg/flytekit/pull/1251/files

worried-restaurant-93221

11/29/2022, 3:18 PM

Just found that, will upgrade

156 Views

Open in Slack

Previous Next