I have an offline model pipeline that works 100% i...
# announcements
e
I have an offline model pipeline that works 100% in spark: preprocessing, training LightGBM (synapse ml), and predictions. I am curious if this could fit into the UnionML framework. This would allow all the jobs to scale really nicely. I can see moving the preprocessing into the Dataset and then swapping out the DataFrame for a spark dataframe. There would need to be some config to have the tasks create spark jobs. Anyways let me know if this doesn’t sound crazy haha
n
This is not crazy at all! Need to improve the docs around this, but you can pass in all the
@task
kwargs to the following decorators: • Dataset.readerModel.trainerModel.predictor Meaning you can specify
SparkConfig
or any other Flyte-compatible task type configuration that typically work with Flyte tasks.
🙏 2
e
🤯 🤯 🤯
Alright I am going to use it next week 🙂
🦜 1
@Niels Bantilan say I want to run a model a few different times with different features sets and possibly filtering rows (like subset to people in the USA). • For running variations, I believe I can just wrap these in flyte tasks • I see how to change the features, but not how to subset rows through the Dataset class. Do you have any recommendations for the second one?
n
You can vary the features. Dataset.reader The dataset reader should output your dataset, and you can parameterize this function however way you wish. Here’s a basic example
Copy code
@dataset.reader
def reader(feature_set: str, row_filter: str) -> pd.DataFrame
    data = ...  # get data using a SQL query, or whatever
    if feature_set == "feature_set_1":
        selected_data = ...
    elif feature_set == "feature_set_2":
        selected_data = ...
    else:
        ...  # etc
    
    if row_filter == "something":
        filtered_data = ...
    elif: ...

    return selected_data
🙏 1
🎉 1
You can then pass these parameters to model.train, where the
***reader_kwargs*
are forwarded to the reader function.
e
So helpful!
n
Actually, would you mind creating an issue to document this use case? https://github.com/unionai-oss/unionml/issues/new Would love to capture it and add a page about this in the docs
e
Yeah of course!
👍 1
n
Come to think of it, the Dataset.loader decorator might be a better place to vary datasets based on per-run parameters, but currently this isn’t supported. Working on this issue to address this tho
e
I added an issue and yes I agree to the dataset loader being a great place.
167 Views