I have an offline model pipeline that works 100% i...
I have an offline model pipeline that works 100% in spark: preprocessing, training LightGBM (synapse ml), and predictions. I am curious if this could fit into the UnionML framework. This would allow all the jobs to scale really nicely. I can see moving the preprocessing into the Dataset and then swapping out the DataFrame for a spark dataframe. There would need to be some config to have the tasks create spark jobs. Anyways let me know if this doesn’t sound crazy haha
This is not crazy at all! Need to improve the docs around this, but you can pass in all the
kwargs to the following decorators: • Dataset.readerModel.trainerModel.predictor Meaning you can specify
or any other Flyte-compatible task type configuration that typically work with Flyte tasks.
Alright I am going to use it next week 🙂
@Niels Bantilan say I want to run a model a few different times with different features sets and possibly filtering rows (like subset to people in the USA). • For running variations, I believe I can just wrap these in flyte tasks • I see how to change the features, but not how to subset rows through the Dataset class. Do you have any recommendations for the second one?
You can vary the features. Dataset.reader The dataset reader should output your dataset, and you can parameterize this function however way you wish. Here’s a basic example
Copy code
def reader(feature_set: str, row_filter: str) -> pd.DataFrame
    data = ...  # get data using a SQL query, or whatever
    if feature_set == "feature_set_1":
        selected_data = ...
    elif feature_set == "feature_set_2":
        selected_data = ...
        ...  # etc
    if row_filter == "something":
        filtered_data = ...
    elif: ...

    return selected_data
You can then pass these parameters to model.train, where the
are forwarded to the reader function.
So helpful!
Actually, would you mind creating an issue to document this use case? https://github.com/unionai-oss/unionml/issues/new Would love to capture it and add a page about this in the docs
Yeah of course!
Come to think of it, the Dataset.loader decorator might be a better place to vary datasets based on per-run parameters, but currently this isn’t supported. Working on this issue to address this tho
I added an issue and yes I agree to the dataset loader being a great place.