https://flyte.org logo
m

Matheus Moreno

09/19/2022, 8:33 PM
Hey, everyone. We're dealing with a weird bug here and we have no idea how to fix it. Basically, we have a task that's not finishing. We run a workflow with 8 tasks, and at the very last one it hangs. All of its code is being executed (a lot of `print`s from the very start all the way to the the
return
statement confirm that), but it hangs at the end and never actually finishes. We were able to reproduce it in our remote server and locally. On the remote server, none of the prints (or logs) are being shown on Stackdriver. What could be happening?
We're using
flytekit~=1.0.0
, so I believe we're using the latest version if compatible release is to be trusted
prints logged when using
pyflyte run
. This
7 None
happens right before the return statement (
None
is the value of that statement)
k

Ketan (kumare3)

09/19/2022, 9:06 PM
so IIUC, you are returning from your code and you are saying that the return completes, but hangs after this?
cc @Eduardo Apolinario (eapolinario) / @Yee Could this be upload of the literal?
m

Matheus Moreno

09/19/2022, 9:08 PM
Yeah, I'm not sure the return completes since I can't execute anything after it 😅 but the code right after it does.
k

Ketan (kumare3)

09/19/2022, 9:09 PM
and you said you can reproduce this locally?
m

Matheus Moreno

09/19/2022, 9:09 PM
Weirdly enough, this task is the only one that returns
None
, and that does not send its return value to another variable in the workflow spec. The workflow also returns
None
.
e

Eduardo Apolinario (eapolinario)

09/19/2022, 9:09 PM
is there anything unusual about the task? Can you share its overall structure?
m

Matheus Moreno

09/19/2022, 9:09 PM
yeah, this screenshot I sent is from
pyflyte run
y

Yee

09/19/2022, 9:10 PM
sounds like a bug? can you change it to
return 5
and make the signature an int?
what’s the return type now?
m

Matheus Moreno

09/19/2022, 9:16 PM
This is the overall structure of the task:
Copy code
@extended_task(integrations=['gcloud'], requests=Resources(mem='4Gi'))
def update_bq_table(
    amnt_dataframe: pd.DataFrame,
    gcs_config_path: str
) -> None:
    config_dict = read_file(gcs_config_path)
    update_gbq_table(    # Function that calls pandas_gbq.to_gbq()
        amnt_dataframe,
        config_dict['table_schema'],
        config_dict['table_destination']
    )
@extended_task
is a special decorator we use that does some pre and post-processing on tasks. Other tasks with it are running fine; the
7 None
on my screenshot above is being called on the wrapper, after a
output = task_func(*args, **kwargs)
and before a
return output
.
y

Yee

09/19/2022, 9:17 PM
can you try changing it to int?
just to see if it that fixes it?
m

Matheus Moreno

09/19/2022, 9:18 PM
The workflow looks like this:
Copy code
@workflow
def main_workflow(
        hotel_amnt_sql_path: str,
        config_path: str,
        config_pre_process_path: str,
        model_config_path: str
) -> None:
    preview_amnt = ...

    # Some other tasks, all with <output> = <function call>

    update_bq_table(
        amnt_dataframe = hotel_topics,
        gcs_config_path = config_path
    )
yeah I'll try it
s

Sérgio de Melo Barreto Junior

09/20/2022, 4:05 PM
Hey! I work with @Matheus Moreno. We changed it to int, but it didn't work 😕
y

Yee

09/20/2022, 5:03 PM
@Matheus Moreno @Sérgio de Melo Barreto Junior hop on call?
m

Matheus Moreno

09/20/2022, 5:09 PM
yeah, send us the link
can you try something for me?
in the body of the task that’s hanging
can you delete everything that’s in that function and just replace it with
Copy code
print(amnt_dataframe.describe().to_html())
and keep all the
print(7)
s
s

Sérgio de Melo Barreto Junior

09/20/2022, 8:03 PM
yeah I'll try it
it is hanging in this
print(amnt_dataframe.describe().to_html())
y

Yee

09/20/2022, 8:20 PM
ah nice
can you remove the
.to_html()
?
and see if it still hangs?
s

Sérgio de Melo Barreto Junior

09/20/2022, 8:42 PM
still hanging
e

Eduardo Apolinario (eapolinario)

09/20/2022, 8:45 PM
how big is this dataframe? Does it hang if you try with only a few rows ?
y

Yee

09/20/2022, 8:45 PM
so describe is failing. yeah how big is this
s

Sérgio de Melo Barreto Junior

09/20/2022, 8:46 PM
237857 rows x 2 columns
y

Yee

09/20/2022, 8:48 PM
that’s not that big…
s

Sérgio de Melo Barreto Junior

09/20/2022, 8:48 PM
5,7MB
y

Yee

09/20/2022, 8:48 PM
are you able to run describe outside of flyte
just like in jupyter or in ipython or something, create the dataframe manually and try describe on it.
s

Sérgio de Melo Barreto Junior

09/20/2022, 8:53 PM
one of our columns is the "topics" and it has list for each row of the dataframe. Is it possible to be the problem?
y

Yee

09/20/2022, 8:54 PM
the whole dataframe is 5.7 MBs?
i don’t think any structure that’s that small should be a problem for pandas
in any case
can you maybe continue to investigate on the side? and in the meantime, add this to the top of your file
Copy code
from flytekit.deck.renderer import TopFrameRenderer
from typing_extensions import Annotated
and then make the task like
Copy code
@task
def mytask() -> Annotated[pd.DataFrame, TopFrameRenderer(10)]: ...
that should make it so that the renderer used just grabs the first 10 rows
will make it skip the describe call
but this is something we should continue to investigate. do you think you can send us a parquet file with the smallest set of data that can repro this?
s

Sérgio de Melo Barreto Junior

09/22/2022, 7:32 PM
@Yee Do you have any update? I am still fighting against this hanging problem 😕
y

Yee

09/22/2022, 8:07 PM
let me play around with this tonight.
but did the workaround not work?
e

Eduardo Apolinario (eapolinario)

09/23/2022, 3:02 AM
just to confirm, I can see the python process get stuck when running this:
Copy code
❯ ipython
Python 3.8.13 (default, Mar 28 2022, 11:38:47)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.5.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: pd.read_parquet("/home/eduardo/Downloads/amnt_dataframe.parquet.gzip")
Out[2]:
                      generic_sku                                           topics
0       HT-0008-0-0-0-0-0-0-0-0-0                  [ST5, ST6, ST2, ST13, ST4, ST3]
1       HT-000M-0-0-0-0-0-0-0-0-0             [ST5, ST6, ST2, ST7, ST13, ST4, ST3]
2       HT-000W-0-0-0-0-0-0-0-0-0            [ST5, ST6, ST12, ST10, ST7, ST4, ST1]
3       HT-000X-0-0-0-0-0-0-0-0-0                       [ST5, ST6, ST13, ST4, ST1]
4       HT-000Z-0-0-0-0-0-0-0-0-0                 [ST5, ST10, ST13, ST4, ST1, ST3]
...                           ...                                              ...
237852  HT-ZZY9-0-0-0-0-0-0-0-0-0                                       [ST1, ST4]
237853  HT-ZZYC-0-0-0-0-0-0-0-0-0  [ST5, ST6, ST2, ST12, ST7, ST13, ST4, ST1, ST3]
237854  HT-ZZYZ-0-0-0-0-0-0-0-0-0                      [ST5, ST10, ST13, ST4, ST1]
237855  HT-ZZZ2-0-0-0-0-0-0-0-0-0                  [ST5, ST12, ST7, ST4, ST1, ST3]
237856  HT-ZZZJ-0-0-0-0-0-0-0-0-0                             [ST4, ST5, ST6, ST2]

[237857 rows x 2 columns]

In [3]: df = pd.read_parquet("/home/eduardo/Downloads/amnt_dataframe.parquet.gzip")

In [4]: df.describe()
m

Matheus Moreno

09/23/2022, 2:00 PM
Yeah, me too. It takes a few seconds for my PC to describe 1k lines, that's why it's taking hours to describe the entire dataset.
The solution that @Yee proposed of annotating the DataFrame limits how many lines will be used by
.describe()
?
k

Ketan (kumare3)

09/23/2022, 2:02 PM
Ya, describing whole dataframe as html does not seem like a good idea
e

Eduardo Apolinario (eapolinario)

09/23/2022, 5:04 PM
@Matheus Moreno, no, what Yee proposed (using the
TopFrameRenderer
) does not run
describe
, instead it turns a fixed number of rows directly into html: https://github.com/flyteorg/flytekit/blob/3cf063955907957de65b035066fe415503a9bd65/flytekit/deck/renderer.py#L17-L27
r

Rahul Mehta

10/12/2022, 9:23 PM
We're on flytekit 1.2.0 & are noticing similar behavior; we have a a 60k row DF w/ ~1.5k columns that we read from our data warehouse & return as a
DataFrame
in a task. We were previously able to run this in ~2 mins on flytekit 1.1.x, but since upgrading this stage is stalling for over 2-3 hrs. It also takes ~90 seconds to read this dataframe in a jupyter notebook We're noticing an interesting memory usage pattern here as well w/ memory inching upwards as the task executes. The CPU (currently 1) is maxed out towards the start of execution Any thoughts on what might have caused this? We're also about to try rolling back flytekit to see if that resolves things
e

Eduardo Apolinario (eapolinario)

10/12/2022, 9:57 PM
@Rahul Mehta, interesting. In this particular case we narrowed this down to a pandas behavior (more specifically, the call to
describe
takes a long time to run). Can you say more about what you're seeing (in a separate thread)?
4 Views