I am using Ray with Modin to process large dataset...
# ask-the-community
f
I am using Ray with Modin to process large dataset in my workflow. Therefore I use modin.pandas.DataFrame & modin.pandas.Series instead of pandas version of the DataFrame & Series in my task’s input params and return values. However, the data serialization error messages I got below suggested that modin.pandas.DataFrame & modin.pandas.Series are not supported by Flyte yet. Am I correct? I think Ray with Modin is an high impact feature since Flyte team wants to support Ray. What will be the process of submitting a Change Request? Thanks. CC: @Kevin Su @Eduardo Apolinario (eapolinario)
Copy code
flytekit.exceptions.scopes.FlyteScopedUserException: Could not find a renderer for <class 'modin.pandas.dataframe.DataFrame'>
...
  File ".../flytekit/types/structured/structured_dataset.py", line 699, in to_html
    raise NotImplementedError(f"Could not find a renderer for {type(df)} in {self.Renderers}")
NotImplementedError: Could not find a renderer for <class 'modin.pandas.dataframe.DataFrame'> in {<class 'pandas.core.frame.DataFrame'>: <flytekit.deck.renderer.TopFrameRenderer object at 0x
e
@Frank Shen, can you confirm which version of flytekit you're running? From the stack trace it looks like this is specific to flytedecks, but those should have been disabled as of flytekit 1.2.3
That doesn't mean we shouldn't support decks for modin dataframes of course, only that this should be tracked separately.
f
Hi @Eduardo Apolinario (eapolinario), I am using flytekit 1.2.0
e
ok, can you do one of two things to unblock you. Either: 1. set
disable_deck=True
in the task definition 2. update to flytekit 1.2.3 and try again?
f
@Eduardo Apolinario (eapolinario), do you mean modin DataFrame, etc. has already been supported by Flyte?
k
Seems like we forget to register a renderer in the modin plugin, I’ll create a pr shortly.
e
@Frank Shen, just want to flag in case it wasn't very clear, but https://docs.flyte.org/projects/cookbook/en/latest/auto/integrations/flytekit_plugins/modin_examples/knn_classifier.html#knn-classifier is an example of using ray and modin.
f
Oh, I haven’t install flytekitplugins-modin yet. Thanks.
@Kevin Su, @Eduardo Apolinario (eapolinario) installing flytekitplugins-modin causing downgrade of flytekit from 1.2.0 to 0.32.6. What have I done wrong?
Copy code
Successfully uninstalled flytekit-1.2.0
Successfully installed checksumdir-1.2.0 flytekit-0.32.6 flytekitplugins-modin-0.31.0
e
@Frank Shen, how did you install it?
f
pip install flytekitplugins-modin
e
@Frank Shen, can you force a version? Something like
pip install flytekitplugins-modin==1.2.4 flytekit==1.2.4
f
@Eduardo Apolinario (eapolinario), that won’t work, because we are also using flytekitplugins-snowflake, and flytekitplugins-snowflake requires flytekit<1.2.0 and >=1.1.0b0
Copy code
The conflict is caused by:
    The user requested flytekit>=1.2.3
    flytekitplugins-snowflake 1.1.1 depends on flytekit<1.2.0 and >=1.1.0b0
    The user requested flytekit>=1.2.3
    flytekitplugins-snowflake 1.1.0 depends on flytekit<1.2.0 and >=1.1.0b0
    The user requested flytekit>=1.2.3
    flytekitplugins-snowflake 1.0.5 depends on flytekit<1.2.0 and >=1.0.0b3
    The user requested flytekit>=1.2.3
    flytekitplugins-snowflake 1.0.4 depends on flytekit<1.2.0 and >=1.1.0b0
    The user requested flytekit>=1.2.3
e
can you also force the snowflake plugin to the same version?
f
like
Copy code
flytekitplugins-snowflake==1.2.4
?
e
yeah, something like
pip install flytekitplugins-modin==1.2.4 flytekit==1.2.4 flytekitplugins-snowflake==1.2.4
f
I think the highest flytekitplugins-snowflake version is 1.1.1, am I wrong? How could I confirm if flytekitplugins-snowflake 1.2.4 exists?
e
f
I see. I will try right now. Thank you @Eduardo Apolinario (eapolinario)!
@Eduardo Apolinario (eapolinario), I have the conflict as shown below. However, it doesn’t make sense to me. Could you tell me where the conflict is? Thanks.
Copy code
The conflict is caused by:
    flytekit 1.2.4 depends on pandas<2.0.0 and >=1.0.0
    modin 0.17.0 depends on pandas==1.5.1
    flytekit 1.2.4 depends on pandas<2.0.0 and >=1.0.0
    modin 0.16.2 depends on pandas==1.5.1
    flytekit 1.2.4 depends on pandas<2.0.0 and >=1.0.0
    modin 0.16.1 depends on pandas==1.5.0
    flytekit 1.2.4 depends on pandas<2.0.0 and >=1.0.0
    modin 0.16.0 depends on pandas==1.5.0
1.0.0 < 1.5.1 < 2.0.0 I don’t see any conflicts.
So I don’t know how to fix.
My requirements.txt is like
Copy code
flytekit==1.2.4
flytekitplugins-snowflake==1.2.4
flytekitplugins-spark==1.2.4
flytekitplugins-modin==1.2.4
xgboost
ray
modin
xgboost_ray
scikit-learn
e
interesting. I just tried in a brand new venv and it worked. Can you paste the full stacktrace of the error you're seeing, @Frank Shen?
f
It worked one time for me, then it kept failing multiple times with various conflicting reasons.
e
can you say more? What do you mean by "it kept failing multiple times with various conflicting reasons"?
f
@Eduardo Apolinario (eapolinario)
e
@Frank Shen, I see this line in the logs:
Copy code
Collecting pandas<2.0.0,>=1.0.0
  Using cached <https://maven.homebox.com/repository/max-pypi-releases/packages/pandas/1.3.5/pandas-1.3.5-cp37-cp37m-macosx_10_9_x86_64.whl> (11.0 MB)
can you add
pandas==1.5.1
in your requirements file?
f
new error:
Copy code
ERROR: Ignored the following versions that require a different python version: 1.4.0 Requires-Python >=3.8; 1.4.0rc0 Requires-Python >=3.8; 1.4.1 Requires-Python >=3.8; 1.4.2 Requires-Python >=3.8; 1.4.3 Requires-Python >=3.8; 1.4.4 Requires-Python >=3.8; 1.5.0 Requires-Python >=3.8; 1.5.0rc0 Requires-Python >=3.8; 1.5.1 Requires-Python >=3.8
ERROR: Could not find a version that satisfies the requirement pandas==1.5.1 (from versions: 0.1, 0.2, 0.3.0, 0.4.0, 0.4.1, 0.4.2, 0.4.3, 0.5.0, 0.6.0, 0.6.1, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.8.0, 0.8.1, 0.9.0, 0.9.1, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.13.0, 0.13.1, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.15.2, 0.16.0, 0.16.1, 0.16.2, 0.17.0, 0.17.1, 0.18.0, 0.18.1, 0.19.0, 0.19.1, 0.19.2, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.21.0, 0.21.1, 0.22.0, 0.23.0, 0.23.1, 0.23.2, 0.23.3, 0.23.4, 0.24.0, 0.24.1, 0.24.2, 0.25.0, 0.25.1, 0.25.2, 0.25.3, 1.0.0, 1.0.1, 1.0.2, 1.0.3, 1.0.4, 1.0.5, 1.1.0, 1.1.1, 1.1.2, 1.1.3, 1.1.4, 1.1.5, 1.2.0, 1.2.1, 1.2.2, 1.2.3, 1.2.4, 1.2.5, 1.3.0, 1.3.1, 1.3.2, 1.3.3, 1.3.4, 1.3.5)
ERROR: No matching distribution found for pandas==1.5.1
e
ok, so the package index you're installing from (https://maven.homebox.com/repository/max-pypi-releases/) doesn't have the latest version of pandas
f
it does. It’s the list above that doesn’t include it.
Copy code
1.3.0, 1.3.1, 1.3.2, 1.3.3, 1.3.4, 1.3.5)
e
oooh, pandas dropped support for python 3.7 in 1.5.0: https://github.com/pandas-dev/pandas/releases/tag/v1.5.0
is there any way you can use a python version >=3.8 ?
@Frank Shen ^
f
Got it. I am working on it. Thank you @Eduardo Apolinario (eapolinario)
Hi @Eduardo Apolinario (eapolinario), it worked! Thank you so much!
e
amazing! Let us know how it goes.
f
Hi @Eduardo Apolinario (eapolinario), I am having trouble with passing bool input to the workflow at commandline.
Copy code
@workflow
def test_wf_train(use_ray: bool = False)
...
when I do this at command line, it doesn’t work.
Copy code
pyflyte run tests/test_xgboost.py test_wf_train --use_ray True
Do you know how to work with bool input at workflow level?
e
oh, just drop the value
True
f
You mean
Copy code
pyflyte run tests/test_xgboost.py test_wf_train --use_ray
?
Or I cannot use bool as input?
k
yeah. if you use the flag
use_ray
, which means the value of
use_ray
is True
Copy code
pyflyte run tests/test_xgboost.py test_wf_train --use_ray
f
Thanks @Kevin Su
Hi @Kevin Su @Eduardo Apolinario (eapolinario), I installed flytekitplugins.modin as advised. I am using input param use_ray: bool to control when to use Ray & modin DataFrame vs pandas.DataFrame. When use_ray is True, the task will return modin DataFrame, if False, return pandas.DataFrame. Therefore I am defining a task’s return type as -> Union[pd.DataFrame, modin_pd.DataFrame].
Copy code
import flytekitplugins.modin
@task
def preprocess(df: pd.DataFrame,
               use_ray: bool
               ) -> Union[pd.DataFrame, modin_pd.DataFrame]:
    if use_ray:
        ray.init()
        df = modin.pandas.DataFrame(df)
....
However, I still got error:
Copy code
{"asctime": "2022-11-23 13:10:46,651", "name": "flytekit", "levelname": "ERROR", "message": "Failed to convert return value for var o0 with error <class 'TypeError'>: Ambiguous choice of variant for union type"}
Traceback (most recent call last):
  File "/Users/fshen/.pyenv/versions/3.8.7/lib/python3.8/site-packages/flytekit/core/base_task.py", line 522, in dispatch_execute
    literals[k] = TypeEngine.to_literal(exec_ctx, v, py_type, literal_type)
  File "/Users/fshen/.pyenv/versions/3.8.7/lib/python3.8/site-packages/flytekit/core/type_engine.py", line 752, in to_literal
    lv = transformer.to_literal(ctx, python_val, python_type, expected)
  File "/Users/fshen/.pyenv/versions/3.8.7/lib/python3.8/site-packages/flytekit/core/type_engine.py", line 1060, in to_literal
    raise TypeError("Ambiguous choice of variant for union type")
TypeError: Ambiguous choice of variant for union type
I am using flytekit==1.2.4 flytekitplugins-modin==1.2.4
Is this because returning a Union type is not supported?
e
cc: @Kevin Su
k
Seems like modin dataframe can also be serialize to flyte literal by pandas transformer because modin dataframe inherits from pandas dataframe.
I’m working on it
103 Views