Rahul Mehta
10/13/2022, 2:05 AMawswrangler
to read dataframes from our data warehouse (in s3), and previously had a basic task that we used to load datasets:
@task
def load_from_warehouse(warehouse_name: str) -> pd.DataFrame:
dataset = warehouse_library.dataset(warehouse_name)
return dataset.read_dataframe()
Where read_dataframe
calls awswrangler under the hood. This previously worked on flytekit 1.1.0, and when we upgraded to flytekit 1.2.0 the identical task took several orders of magnitude longer to complete (from 120s -> 90+ minutes)
I'm curious if there was a regression introduced that led to a significant performance issue when saving dataframes to parquet. We tested just rolling back flytekit from 1.2.0 -> 1.1.0 and this resolved the issue for us.
Another data point is that tasks that had pd.DataFrame
either as an input or output were affectedKevin Su
10/13/2022, 2:26 AMRahul Mehta
10/13/2022, 2:38 AMKetan (kumare3)
Eduardo Apolinario (eapolinario)
10/13/2022, 4:50 AMRahul Mehta
10/13/2022, 5:04 AMEduardo Apolinario (eapolinario)
10/13/2022, 10:07 PMRahul Mehta
10/13/2022, 10:14 PMEduardo Apolinario (eapolinario)
10/13/2022, 11:26 PMStructuredDataset
construct). If you don't care about the automatically-generated deck (and it looks like you don't) you can pass disable_deck=True
to the @task
that produces the dataframeRahul Mehta
10/14/2022, 2:06 PMdisable_deck=True
in our global decorator that we use (so we can centralize configs like this) and the issue seems to persist w/ the exact same memory usage patternEduardo Apolinario (eapolinario)
10/14/2022, 11:56 PMTopFrameRenderer
call like:
from flytekit.deck import TopFrameRenderer
@task
def t() -> Annotated[pd.DataFrame, TopFrameRenderer(10)]:
Rahul Mehta
10/15/2022, 12:00 AMEduardo Apolinario (eapolinario)
10/15/2022, 12:05 AMKetan (kumare3)
Rahul Mehta
10/15/2022, 1:57 AMKetan (kumare3)
Rahul Mehta
10/15/2022, 9:42 PMpd.DataFrame
in a types
module
DataFrame = Annotated[pd.DataFrame, deck.TopFrameRenderer(10)]
It'd be nice to not require using this in place of pd.DataFrame (it's a flyte-specific detail our team needs to remember), so curious what the longer-term fix here isKetan (kumare3)
Eduardo Apolinario (eapolinario)
10/17/2022, 6:46 PMRahul Mehta
10/18/2022, 12:14 AMKetan (kumare3)
Eduardo Apolinario (eapolinario)
10/18/2022, 12:31 AMKetan (kumare3)
Eduardo Apolinario (eapolinario)
10/18/2022, 11:51 PMTim Bauer
11/29/2022, 2:39 PM[4978309 rows x 15 columns]
- calling to_html on it takes upwards of 10 minutes for me, at least that's when I ragequit the debugger.