Hey, everyone. We're dealing with a weird bug here...
# ask-the-community
m
Hey, everyone. We're dealing with a weird bug here and we have no idea how to fix it. Basically, we have a task that's not finishing. We run a workflow with 8 tasks, and at the very last one it hangs. All of its code is being executed (a lot of `print`s from the very start all the way to the the
return
statement confirm that), but it hangs at the end and never actually finishes. We were able to reproduce it in our remote server and locally. On the remote server, none of the prints (or logs) are being shown on Stackdriver. What could be happening?
We're using
flytekit~=1.0.0
, so I believe we're using the latest version if compatible release is to be trusted
prints logged when using
pyflyte run
. This
7 None
happens right before the return statement (
None
is the value of that statement)
k
so IIUC, you are returning from your code and you are saying that the return completes, but hangs after this?
cc @Eduardo Apolinario (eapolinario) / @Yee Could this be upload of the literal?
m
Yeah, I'm not sure the return completes since I can't execute anything after it 😅 but the code right after it does.
k
and you said you can reproduce this locally?
m
Weirdly enough, this task is the only one that returns
None
, and that does not send its return value to another variable in the workflow spec. The workflow also returns
None
.
e
is there anything unusual about the task? Can you share its overall structure?
m
yeah, this screenshot I sent is from
pyflyte run
y
sounds like a bug? can you change it to
return 5
and make the signature an int?
what’s the return type now?
m
This is the overall structure of the task:
Copy code
@extended_task(integrations=['gcloud'], requests=Resources(mem='4Gi'))
def update_bq_table(
    amnt_dataframe: pd.DataFrame,
    gcs_config_path: str
) -> None:
    config_dict = read_file(gcs_config_path)
    update_gbq_table(    # Function that calls pandas_gbq.to_gbq()
        amnt_dataframe,
        config_dict['table_schema'],
        config_dict['table_destination']
    )
@extended_task
is a special decorator we use that does some pre and post-processing on tasks. Other tasks with it are running fine; the
7 None
on my screenshot above is being called on the wrapper, after a
output = task_func(*args, **kwargs)
and before a
return output
.
y
can you try changing it to int?
just to see if it that fixes it?
m
The workflow looks like this:
Copy code
@workflow
def main_workflow(
        hotel_amnt_sql_path: str,
        config_path: str,
        config_pre_process_path: str,
        model_config_path: str
) -> None:
    preview_amnt = ...

    # Some other tasks, all with <output> = <function call>

    update_bq_table(
        amnt_dataframe = hotel_topics,
        gcs_config_path = config_path
    )
yeah I'll try it
s
Hey! I work with @Matheus Moreno. We changed it to int, but it didn't work 😕
y
@Matheus Moreno @Sérgio de Melo Barreto Junior hop on call?
m
yeah, send us the link
can you try something for me?
in the body of the task that’s hanging
can you delete everything that’s in that function and just replace it with
Copy code
print(amnt_dataframe.describe().to_html())
and keep all the
print(7)
s
s
yeah I'll try it
it is hanging in this
print(amnt_dataframe.describe().to_html())
y
ah nice
can you remove the
.to_html()
?
and see if it still hangs?
s
still hanging
e
how big is this dataframe? Does it hang if you try with only a few rows ?
y
so describe is failing. yeah how big is this
s
237857 rows x 2 columns
y
that’s not that big…
s
5,7MB
y
are you able to run describe outside of flyte
just like in jupyter or in ipython or something, create the dataframe manually and try describe on it.
s
one of our columns is the "topics" and it has list for each row of the dataframe. Is it possible to be the problem?
y
the whole dataframe is 5.7 MBs?
i don’t think any structure that’s that small should be a problem for pandas
in any case
can you maybe continue to investigate on the side? and in the meantime, add this to the top of your file
Copy code
from flytekit.deck.renderer import TopFrameRenderer
from typing_extensions import Annotated
and then make the task like
Copy code
@task
def mytask() -> Annotated[pd.DataFrame, TopFrameRenderer(10)]: ...
that should make it so that the renderer used just grabs the first 10 rows
will make it skip the describe call
but this is something we should continue to investigate. do you think you can send us a parquet file with the smallest set of data that can repro this?
s
@Yee Do you have any update? I am still fighting against this hanging problem 😕
y
let me play around with this tonight.
but did the workaround not work?
e
just to confirm, I can see the python process get stuck when running this:
Copy code
❯ ipython
Python 3.8.13 (default, Mar 28 2022, 11:38:47)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.5.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: pd.read_parquet("/home/eduardo/Downloads/amnt_dataframe.parquet.gzip")
Out[2]:
                      generic_sku                                           topics
0       HT-0008-0-0-0-0-0-0-0-0-0                  [ST5, ST6, ST2, ST13, ST4, ST3]
1       HT-000M-0-0-0-0-0-0-0-0-0             [ST5, ST6, ST2, ST7, ST13, ST4, ST3]
2       HT-000W-0-0-0-0-0-0-0-0-0            [ST5, ST6, ST12, ST10, ST7, ST4, ST1]
3       HT-000X-0-0-0-0-0-0-0-0-0                       [ST5, ST6, ST13, ST4, ST1]
4       HT-000Z-0-0-0-0-0-0-0-0-0                 [ST5, ST10, ST13, ST4, ST1, ST3]
...                           ...                                              ...
237852  HT-ZZY9-0-0-0-0-0-0-0-0-0                                       [ST1, ST4]
237853  HT-ZZYC-0-0-0-0-0-0-0-0-0  [ST5, ST6, ST2, ST12, ST7, ST13, ST4, ST1, ST3]
237854  HT-ZZYZ-0-0-0-0-0-0-0-0-0                      [ST5, ST10, ST13, ST4, ST1]
237855  HT-ZZZ2-0-0-0-0-0-0-0-0-0                  [ST5, ST12, ST7, ST4, ST1, ST3]
237856  HT-ZZZJ-0-0-0-0-0-0-0-0-0                             [ST4, ST5, ST6, ST2]

[237857 rows x 2 columns]

In [3]: df = pd.read_parquet("/home/eduardo/Downloads/amnt_dataframe.parquet.gzip")

In [4]: df.describe()
m
Yeah, me too. It takes a few seconds for my PC to describe 1k lines, that's why it's taking hours to describe the entire dataset.
The solution that @Yee proposed of annotating the DataFrame limits how many lines will be used by
.describe()
?
k
Ya, describing whole dataframe as html does not seem like a good idea
e
@Matheus Moreno, no, what Yee proposed (using the
TopFrameRenderer
) does not run
describe
, instead it turns a fixed number of rows directly into html: https://github.com/flyteorg/flytekit/blob/3cf063955907957de65b035066fe415503a9bd65/flytekit/deck/renderer.py#L17-L27
r
We're on flytekit 1.2.0 & are noticing similar behavior; we have a a 60k row DF w/ ~1.5k columns that we read from our data warehouse & return as a
DataFrame
in a task. We were previously able to run this in ~2 mins on flytekit 1.1.x, but since upgrading this stage is stalling for over 2-3 hrs. It also takes ~90 seconds to read this dataframe in a jupyter notebook We're noticing an interesting memory usage pattern here as well w/ memory inching upwards as the task executes. The CPU (currently 1) is maxed out towards the start of execution Any thoughts on what might have caused this? We're also about to try rolling back flytekit to see if that resolves things
e
@Rahul Mehta, interesting. In this particular case we narrowed this down to a pandas behavior (more specifically, the call to
describe
takes a long time to run). Can you say more about what you're seeing (in a separate thread)?
110 Views