Hey everyone We re dealing with a weird bug here and we have Flyte #flyte-support

Hey, everyone. We're dealing with a weird bug here...

sparse-window-1536

09/19/2022, 8:33 PM

Hey, everyone. We're dealing with a weird bug here and we have no idea how to fix it. Basically, we have a task that's not finishing. We run a workflow with 8 tasks, and at the very last one it hangs. All of its code is being executed (a lot of `print`s from the very start all the way to the the

return

statement confirm that), but it hangs at the end and never actually finishes. We were able to reproduce it in our remote server and locally. On the remote server, none of the prints (or logs) are being shown on Stackdriver. What could be happening?

sparse-window-1536

09/19/2022, 8:34 PM

We're using

flytekit~=1.0.0

, so I believe we're using the latest version if compatible release is to be trusted

sparse-window-1536

09/19/2022, 8:36 PM

prints logged when using

pyflyte run

. This

7 None

happens right before the return statement (

None

is the value of that statement)

freezing-airport-6809

09/19/2022, 9:06 PM

so IIUC, you are returning from your code and you are saying that the return completes, but hangs after this?

freezing-airport-6809

09/19/2022, 9:07 PM

cc @high-accountant-32689 / @thankful-minister-83577 Could this be upload of the literal?

sparse-window-1536

09/19/2022, 9:08 PM

Yeah, I'm not sure the return completes since I can't execute anything after it 😅 but the code right after it does.

freezing-airport-6809

09/19/2022, 9:09 PM

and you said you can reproduce this locally?

sparse-window-1536

09/19/2022, 9:09 PM

Weirdly enough, this task is the only one that returns

None

, and that does not send its return value to another variable in the workflow spec. The workflow also returns

None

high-accountant-32689

09/19/2022, 9:09 PM

is there anything unusual about the task? Can you share its overall structure?

sparse-window-1536

09/19/2022, 9:09 PM

yeah, this screenshot I sent is from

pyflyte run

thankful-minister-83577

09/19/2022, 9:10 PM

sounds like a bug? can you change it to

return 5

and make the signature an int?

thankful-minister-83577

09/19/2022, 9:10 PM

what’s the return type now?

sparse-window-1536

09/19/2022, 9:16 PM

This is the overall structure of the task:

Copy code

@extended_task(integrations=['gcloud'], requests=Resources(mem='4Gi'))
def update_bq_table(
    amnt_dataframe: pd.DataFrame,
    gcs_config_path: str
) -> None:
    config_dict = read_file(gcs_config_path)
    update_gbq_table(    # Function that calls pandas_gbq.to_gbq()
        amnt_dataframe,
        config_dict['table_schema'],
        config_dict['table_destination']
    )

@extended_task

is a special decorator we use that does some pre and post-processing on tasks. Other tasks with it are running fine; the

7 None

on my screenshot above is being called on the wrapper, after a

output = task_func(*args, **kwargs)

and before a

return output

thankful-minister-83577

09/19/2022, 9:17 PM

can you try changing it to int?

thankful-minister-83577

09/19/2022, 9:17 PM

just to see if it that fixes it?

sparse-window-1536

09/19/2022, 9:18 PM

The workflow looks like this:

Copy code

@workflow
def main_workflow(
        hotel_amnt_sql_path: str,
        config_path: str,
        config_pre_process_path: str,
        model_config_path: str
) -> None:
    preview_amnt = ...

    # Some other tasks, all with <output> = <function call>

    update_bq_table(
        amnt_dataframe = hotel_topics,
        gcs_config_path = config_path
    )

sparse-window-1536

09/19/2022, 9:18 PM

yeah I'll try it

abundant-night-96152

09/20/2022, 4:05 PM

Hey! I work with @sparse-window-1536. We changed it to int, but it didn't work 😕

thankful-minister-83577

09/20/2022, 5:03 PM

@sparse-window-1536 @abundant-night-96152 hop on call?

👍 1

sparse-window-1536

09/20/2022, 5:09 PM

yeah, send us the link

thankful-minister-83577

09/20/2022, 5:12 PM

https://meet.google.com/acf-bgqq-hsz

thankful-minister-83577

09/20/2022, 7:43 PM

can you try something for me?

thankful-minister-83577

09/20/2022, 7:43 PM

in the body of the task that’s hanging

thankful-minister-83577

09/20/2022, 7:43 PM

can you delete everything that’s in that function and just replace it with

thankful-minister-83577

09/20/2022, 7:43 PM

Copy code

print(amnt_dataframe.describe().to_html())

and keep all the

print(7)

abundant-night-96152

09/20/2022, 8:03 PM

yeah I'll try it

abundant-night-96152

09/20/2022, 8:11 PM

it is hanging in this

print(amnt_dataframe.describe().to_html())

thankful-minister-83577

09/20/2022, 8:20 PM

ah nice

thankful-minister-83577

09/20/2022, 8:21 PM

can you remove the

.to_html()

thankful-minister-83577

09/20/2022, 8:21 PM

and see if it still hangs?

abundant-night-96152

09/20/2022, 8:42 PM

still hanging

high-accountant-32689

09/20/2022, 8:45 PM

how big is this dataframe? Does it hang if you try with only a few rows ?

thankful-minister-83577

09/20/2022, 8:45 PM

so describe is failing. yeah how big is this

abundant-night-96152

09/20/2022, 8:46 PM

237857 rows x 2 columns

thankful-minister-83577

09/20/2022, 8:48 PM

that’s not that big…

abundant-night-96152

09/20/2022, 8:48 PM

5,7MB

thankful-minister-83577

09/20/2022, 8:48 PM

are you able to run describe outside of flyte

thankful-minister-83577

09/20/2022, 8:49 PM

just like in jupyter or in ipython or something, create the dataframe manually and try describe on it.

abundant-night-96152

09/20/2022, 8:53 PM

one of our columns is the "topics" and it has list for each row of the dataframe. Is it possible to be the problem?

thankful-minister-83577

09/20/2022, 8:54 PM

the whole dataframe is 5.7 MBs?

thankful-minister-83577

09/20/2022, 8:56 PM

i don’t think any structure that’s that small should be a problem for pandas

thankful-minister-83577

09/20/2022, 8:56 PM

in any case

thankful-minister-83577

09/20/2022, 8:57 PM

can you maybe continue to investigate on the side? and in the meantime, add this to the top of your file

Copy code

from flytekit.deck.renderer import TopFrameRenderer
from typing_extensions import Annotated

and then make the task like

Copy code

@task
def mytask() -> Annotated[pd.DataFrame, TopFrameRenderer(10)]: ...

that should make it so that the renderer used just grabs the first 10 rows

thankful-minister-83577

09/20/2022, 8:57 PM

will make it skip the describe call

thankful-minister-83577

09/20/2022, 8:59 PM

but this is something we should continue to investigate. do you think you can send us a parquet file with the smallest set of data that can repro this?

abundant-night-96152

09/20/2022, 11:15 PM

message has been deleted

abundant-night-96152

09/22/2022, 7:32 PM

@thankful-minister-83577 Do you have any update? I am still fighting against this hanging problem 😕

thankful-minister-83577

09/22/2022, 8:07 PM

let me play around with this tonight.

thankful-minister-83577

09/22/2022, 8:07 PM

but did the workaround not work?

high-accountant-32689

09/23/2022, 3:02 AM

just to confirm, I can see the python process get stuck when running this:

Copy code

❯ ipython
Python 3.8.13 (default, Mar 28 2022, 11:38:47)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.5.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: pd.read_parquet("/home/eduardo/Downloads/amnt_dataframe.parquet.gzip")
Out[2]:
                      generic_sku                                           topics
0       HT-0008-0-0-0-0-0-0-0-0-0                  [ST5, ST6, ST2, ST13, ST4, ST3]
1       HT-000M-0-0-0-0-0-0-0-0-0             [ST5, ST6, ST2, ST7, ST13, ST4, ST3]
2       HT-000W-0-0-0-0-0-0-0-0-0            [ST5, ST6, ST12, ST10, ST7, ST4, ST1]
3       HT-000X-0-0-0-0-0-0-0-0-0                       [ST5, ST6, ST13, ST4, ST1]
4       HT-000Z-0-0-0-0-0-0-0-0-0                 [ST5, ST10, ST13, ST4, ST1, ST3]
...                           ...                                              ...
237852  HT-ZZY9-0-0-0-0-0-0-0-0-0                                       [ST1, ST4]
237853  HT-ZZYC-0-0-0-0-0-0-0-0-0  [ST5, ST6, ST2, ST12, ST7, ST13, ST4, ST1, ST3]
237854  HT-ZZYZ-0-0-0-0-0-0-0-0-0                      [ST5, ST10, ST13, ST4, ST1]
237855  HT-ZZZ2-0-0-0-0-0-0-0-0-0                  [ST5, ST12, ST7, ST4, ST1, ST3]
237856  HT-ZZZJ-0-0-0-0-0-0-0-0-0                             [ST4, ST5, ST6, ST2]

[237857 rows x 2 columns]

In [3]: df = pd.read_parquet("/home/eduardo/Downloads/amnt_dataframe.parquet.gzip")

In [4]: df.describe()

sparse-window-1536

09/23/2022, 2:00 PM

Yeah, me too. It takes a few seconds for my PC to describe 1k lines, that's why it's taking hours to describe the entire dataset.

sparse-window-1536

09/23/2022, 2:01 PM

The solution that @thankful-minister-83577 proposed of annotating the DataFrame limits how many lines will be used by

.describe()

freezing-airport-6809

09/23/2022, 2:02 PM

Ya, describing whole dataframe as html does not seem like a good idea

high-accountant-32689

09/23/2022, 5:04 PM

@sparse-window-1536, no, what Yee proposed (using the

TopFrameRenderer

) does not run

describe

, instead it turns a fixed number of rows directly into html: https://github.com/flyteorg/flytekit/blob/3cf063955907957de65b035066fe415503a9bd65/flytekit/deck/renderer.py#L17-L27

elegant-australia-91422

10/12/2022, 9:23 PM

We're on flytekit 1.2.0 & are noticing similar behavior; we have a a 60k row DF w/ ~1.5k columns that we read from our data warehouse & return as a

DataFrame

in a task. We were previously able to run this in ~2 mins on flytekit 1.1.x, but since upgrading this stage is stalling for over 2-3 hrs. It also takes ~90 seconds to read this dataframe in a jupyter notebook We're noticing an interesting memory usage pattern here as well w/ memory inching upwards as the task executes. The CPU (currently 1) is maxed out towards the start of execution Any thoughts on what might have caused this? We're also about to try rolling back flytekit to see if that resolves things

high-accountant-32689

10/12/2022, 9:57 PM

@elegant-australia-91422, interesting. In this particular case we narrowed this down to a pandas behavior (more specifically, the call to

describe

takes a long time to run). Can you say more about what you're seeing (in a separate thread)?

163 Views

Open in Slack

Previous Next