best-oil-18906
11/09/2024, 9:38 AMfreezing-airport-6809
freezing-airport-6809
freezing-airport-6809
best-oil-18906
11/09/2024, 3:31 PMfreezing-airport-6809
freezing-airport-6809
freezing-airport-6809
best-oil-18906
11/09/2024, 4:06 PMfreezing-airport-6809
freezing-airport-6809
freezing-airport-6809
astonishing-eve-54331
11/09/2024, 4:46 PMyaml
configurations and programmatically execute workflows from them.
You do not need a rigorous "config dataclass for each step" - Flyte allows you to access attributes of the dataclasses in the Flyte workflow DSL.
Otherwise, if you want to also programmatically construct / define a workflow from the configs (in addition to programmatically executing them from configs), I think you would be best served by exploring how to integrate Hydra + Pydantic with Flyte's "Imperative Workflows".
My tutorial above should also play nicely with imperative workflows, but this will take a little bit more work to set up, as you will need to "bind" your defined tasks to your configurations in some way. I have also been thinking about an automated way to do this with some metaprogramming but it can be a little messy at times.
Does this make sense for your use case?best-oil-18906
11/09/2024, 7:39 PMbest-oil-18906
11/10/2024, 4:28 PMastonishing-eve-54331
11/10/2024, 4:47 PMXGBoostConfig
vs SVMConfig
) in order to select the appropriate task. This will programmatically construct a new workflow on the fly locally from your configurations. Perhaps with some clever abstractions you could automate the mapping between configurations and tasks, as well as creating unique workflow names per unique DAG structure.
In the future, I believe that your needs would be better suited by "Eager workflows". Eager workflows will be more literal, ergonomic and extensible. These are a still a work in progress however (so I would not recommend them just yet), but my brilliant colleagues are investing resources into them over the next several months @wide-vegetable-51116 @flaky-parrot-42438astonishing-eve-54331
11/10/2024, 4:48 PMfreezing-airport-6809
astonishing-eve-54331
11/10/2024, 5:42 PMfreezing-airport-6809
freezing-airport-6809
astonishing-eve-54331
11/10/2024, 6:10 PMfreezing-airport-6809
astonishing-eve-54331
11/10/2024, 10:27 PMfreezing-airport-6809
best-oil-18906
11/17/2024, 9:45 AMastonishing-eve-54331
11/17/2024, 11:26 AMlist[str]
or list[MyDataSource]
to represent your data sources, I would highly recommend creating a small task to sort
the items in the list. You want to ensure that downstream users don't unintentionally create unnecessary "cache misses" of the transform
task by simply changing the order of data sources.
2. It may not be applicable or feasible in your case, but if you can, it would be very beneficial to ensure that in the transform
task, if each data source you are trying to merge together are already sorted row-wise, you might be able to simply concatenate
instead of join
/ merge
. This would likely be much faster and require less memory.
3. Given the configurations to train_test_transform
, I would highly recommend trying to discretize the options here. IE: only allow users to select 80/10/10
, 70/15/15
, or 60/20/20
, or other such discrete strategies. I guess, you don't really want to give each user the ability to define continuous train-test split hyperparameters because this could result in effectively useless cache-misses. Also, you should ensure that this operation is idempotent / reproducible.
4. "one type of task takes import path to model" ... if you haven't already, I would highly recommend checking out this neat functionality from hydra that will convert the path to model and load that in for you. But right, you will need to pass in the name of the model path and instantiate it inside of the task, whereas with my previous recommendation Hydra would try to instantiate the model in your local environment (which could work but might result in weird behavior).
5. You might run into some issues around Pydantic's support for type unions and Flyte's requirement around strict typing. In other words: a config of type XGBoostConfig|RandomForestConfig
might throw an error. I would instead recommend a "parent" dataclass that contains all of the possible children dataclasses, each of which are optional, and then you would override the None
to your config for your specific model type.astonishing-eve-54331
11/17/2024, 11:27 AMbest-oil-18906
11/17/2024, 1:29 PMastonishing-eve-54331
11/17/2024, 3:08 PMbest-oil-18906
11/17/2024, 4:16 PMastonishing-eve-54331
11/17/2024, 4:47 PMastonishing-eve-54331
11/17/2024, 4:47 PMbest-oil-18906
11/17/2024, 4:50 PMastonishing-eve-54331
11/17/2024, 5:55 PMtrain_test_split
from sklearn
Executing this in its own separate task means that you are doubling the memory requirements and doubling the disk space of the data that you are backing to blob storage. This is pretty inefficient.
If you instead use the aforementioned method, you can lazily evaluate your splits during model training, but for larger-than-memory observation streaming (torch IterableDataset or TF datasets) as well as in-memory data loading for model training.
For example, you could use a Polars LazyFrame to load in your unsplit data, and then evaluate a filter that will hash each observation’s primary key. Because this is lazily evaluated, you only have to actually load your training / validation data for model training, and you only have to load your testing data during OOS evaluation. You get all the same functionality of train_test_split
without having to duplicate your data or load in more data than necessary at any point in time. You also don’t need to create “indices” for which observation belongs to each strata, because that information is available given the hash of the primary key. It also guarantees reproducibility.
This same method can be used for iterable observation filtering for your DL models.astonishing-eve-54331
11/17/2024, 5:56 PMbest-oil-18906
11/17/2024, 9:50 PMastonishing-eve-54331
11/17/2024, 10:09 PMastonishing-eve-54331
11/17/2024, 10:12 PM