elegant-australia-91422
10/13/2022, 2:05 AMelegant-australia-91422
10/13/2022, 2:08 AMawswrangler
to read dataframes from our data warehouse (in s3), and previously had a basic task that we used to load datasets:
@task
def load_from_warehouse(warehouse_name: str) -> pd.DataFrame:
dataset = warehouse_library.dataset(warehouse_name)
return dataset.read_dataframe()
Where read_dataframe
calls awswrangler under the hood. This previously worked on flytekit 1.1.0, and when we upgraded to flytekit 1.2.0 the identical task took several orders of magnitude longer to complete (from 120s -> 90+ minutes)
I'm curious if there was a regression introduced that led to a significant performance issue when saving dataframes to parquet. We tested just rolling back flytekit from 1.2.0 -> 1.1.0 and this resolved the issue for us.
Another data point is that tasks that had pd.DataFrame
either as an input or output were affectedelegant-australia-91422
10/13/2022, 2:09 AMelegant-australia-91422
10/13/2022, 2:10 AMglamorous-carpet-83516
10/13/2022, 2:26 AMelegant-australia-91422
10/13/2022, 2:38 AMfreezing-airport-6809
high-accountant-32689
10/13/2022, 4:50 AMelegant-australia-91422
10/13/2022, 5:04 AMhigh-accountant-32689
10/13/2022, 10:07 PMelegant-australia-91422
10/13/2022, 10:14 PMhigh-accountant-32689
10/13/2022, 11:26 PMStructuredDataset
construct). If you don't care about the automatically-generated deck (and it looks like you don't) you can pass disable_deck=True
to the @task
that produces the dataframeelegant-australia-91422
10/14/2022, 2:06 PMdisable_deck=True
in our global decorator that we use (so we can centralize configs like this) and the issue seems to persist w/ the exact same memory usage patternhigh-accountant-32689
10/14/2022, 11:56 PMTopFrameRenderer
call like:
from flytekit.deck import TopFrameRenderer
@task
def t() -> Annotated[pd.DataFrame, TopFrameRenderer(10)]:
high-accountant-32689
10/14/2022, 11:58 PMelegant-australia-91422
10/15/2022, 12:00 AMhigh-accountant-32689
10/15/2022, 12:05 AMfreezing-airport-6809
elegant-australia-91422
10/15/2022, 1:57 AMelegant-australia-91422
10/15/2022, 1:57 AMfreezing-airport-6809
freezing-airport-6809
elegant-australia-91422
10/15/2022, 9:42 PMpd.DataFrame
in a types
module
DataFrame = Annotated[pd.DataFrame, deck.TopFrameRenderer(10)]
It'd be nice to not require using this in place of pd.DataFrame (it's a flyte-specific detail our team needs to remember), so curious what the longer-term fix here isfreezing-airport-6809
freezing-airport-6809
high-accountant-32689
10/17/2022, 6:46 PMelegant-australia-91422
10/18/2022, 12:14 AMfreezing-airport-6809
high-accountant-32689
10/18/2022, 12:31 AMfreezing-airport-6809
high-accountant-32689
10/18/2022, 11:51 PMworried-restaurant-93221
11/29/2022, 2:39 PMworried-restaurant-93221
11/29/2022, 2:51 PMworried-restaurant-93221
11/29/2022, 2:54 PM[4978309 rows x 15 columns]
- calling to_html on it takes upwards of 10 minutes for me, at least that's when I ragequit the debugger.worried-restaurant-93221
11/29/2022, 3:03 PMworried-restaurant-93221
11/29/2022, 3:17 PMworried-restaurant-93221
11/29/2022, 3:18 PM