Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

I have an offline model pipeline that works 100% in spark: preprocessing, training LightGBM (synapse ml), and predictions. I am curious if this could fit into the UnionML framework. This would allow all the jobs to scale really nicely.

I can see moving the preprocessing into the Dataset and then swapping out the DataFrame for a spark dataframe. There would need to be some config to have the tasks create spark jobs.

Anyways let me know if this doesn’t sound crazy haha

This is not crazy at all!

Need to improve the docs around this, but you can pass in all the `@task` kwargs to the following decorators:
• <https://unionml.readthedocs.io/en/latest/dataset.html#reader|Dataset.reader>
• <https://unionml.readthedocs.io/en/latest/model.html#trainer|Model.trainer>
• <https://unionml.readthedocs.io/en/latest/model.html#predictor|Model.predictor>
Meaning you can specify `SparkConfig` or any other Flyte-compatible task type configuration that typically work with Flyte tasks.

:exploding_head: :exploding_head: :exploding_head:

Alright I am going to use it next week :slightly_smiling_face:

<@U01DYLVUNJE> say I want to run  a model a few different times with different features sets and possibly filtering rows (like subset to people in the USA).

• For running variations, I believe I can just wrap these in flyte tasks
• I see how to change the features, but not how to subset rows through the Dataset class.
Do you have any recommendations for the second one?

You can vary the features.

*<https://unionml.readthedocs.io/en/latest/dataset.html#reader|Dataset.reader>*
The dataset reader should output your dataset, and you can parameterize this function however way you wish. Here’s a basic example

```@dataset.reader
def reader(feature_set: str, row_filter: str) -&gt; pd.DataFrame
    data = ...  # get data using a SQL query, or whatever
    if feature_set == "feature_set_1":
        selected_data = ...
    elif feature_set == "feature_set_2":
        selected_data = ...
    else:
        ...  # etc
    
    if row_filter == "something":
        filtered_data = ...
    elif: ...

    return selected_data```

You can then pass these parameters to <https://unionml.readthedocs.io/en/latest/generated_api_reference/unionml.model.Model.html#unionml.model.Model.train|model.train>, where the `***reader_kwargs*` are forwarded to the reader function.

Actually, would you mind creating an issue to document this use case? <https://github.com/unionai-oss/unionml/issues/new>

Would love to capture it and add a page about this in the docs

Come to think of it, the <https://unionml.readthedocs.io/en/latest/dataset.html#loader|Dataset.loader> decorator might be a better place to vary datasets based on per-run parameters, but currently this isn’t supported.

Working on <https://github.com/unionai-oss/unionml/issues/98|this issue> to address this tho

I added an issue and yes I agree to the dataset loader being a great place.