Hi all. Hope this is the best channel for this qu...
# ask-the-community
c
Hi all. Hope this is the best channel for this question. I have a project consisting of multiple "steps", each of which must be executed from a
.py
file (e.g. download_data.py, preprocess_data.py, train_model.py, eval_model.py, etc). Currently I have wrangled these scattered .py scripts into somewhat of a workflow using a Makefile such that each step in the pipeline can be executed through a
make
command (e.g.
make download_data
,
make_preprocess_data
, etc). The target of each Makefile step calls a .
sh
shell script that executes the
.py
file for that step. The command
make run_entire_pipeline
calls each of the ~7 steps in sequence, as a rudimentary (linear) DAG. Obviously this rough pipeline misses a few benefits such as caching earlier steps such that they do not need to be executed if they've already been performed (e.g. no need to re-download data on a subsequent model training pipeline run if the data has already been downloaded on an earlier run of the pipeline and if there have been no changes in that data). What is the best way to migrate this Make-based workflow into a Flyte-based workflow? Specifically is there a way to map each
.py
scripts to a
@task
when building a
@workflow
pipeline in Flyte? I learned about the Flyte "Script mode", and it sounds somewhat akin to what I'm trying to do, but I'm totally new to Flyte. Thanks for any help and direction. I'm working with very large digital pathology whole slide image (WSI) images, BTW. Does Flyte support inputs of the WSI variety? I.e.
.mrxs
,
.tiff
,
.czi
,
.jpeg
,
.png
, etc?
đź‘‹ 4
k
Hi @Chris Poptic firstly welcome to the community. I will answer questions from your laptop unless someone else beats me
🙏 1
Obviously this rough pipeline misses a few benefits such as caching earlier steps such that they do not need to be executed if they’ve already been performed (e.g. no need to re-download data on a subsequent model training pipeline run if the data has already been downloaded on an earlier run of the pipeline and if there have been no changes in that data).
The benefits you get is - failure tolerance, distributed execution, caching and isolation. Today you do not get the benefit of re-using data that has been downloaded already
What is the best way to migrate this Make-based workflow into a Flyte-based workflow? Specifically is there a way to map each
.py
scripts to a
@task
when building a
@workflow
pipeline in Flyte? I learned about the Flyte “Script mode”, and it sounds somewhat akin to what I’m trying to do, but I’m totally new to Flyte. Thanks for any help and direction.
There are 2 ways 1. Use the ShellTask to model what you had today with little more data passing. Thus model it as a Flyte workflow 2. Or update your scripts to have a
task
function each
Copy code
@task
def foo(...):
  globals ...
3. You can also mix and match. The workflow can also be constructed either using imperative model or using the
@workflow
syntax/DSL Note : you can ofcourse mix and match and slowly migrate if you want. Ideally migrate to the
@task
syntax as this is already python
I’m working with very large digital pathology whole slide image (WSI) images, BTW. Does Flyte support inputs of the WSI variety? I.e.
.mrxs
,
.tiff
,
.czi
,
.jpeg
,
.png
, etc
Any type of File can be handled using the FlyteFile It will automatically upload and download files to S3/GCS etc Example: Workfing Withe Files
🙏 1
c
Thanks @Ketan (kumare3). This is awesome. Am I correct that we can't simply drop in the shell script into the
ShellTask
object? From the
ShellTask
docs it looks like the user would have to refactor the shell script to explicitly specify the script's inputs and outputs. Specifically using the syntax
{inputs.input_name}
and
{outputs.output_name}
. Obviously we could do this manually for each input and output. But what if you're passing in an entire dictionary of config params (using something like Hydra's ConfigDict object). Could you simply pass in all those numerous hyperparameters using
{inputs.myHydraConfigObject}
rather than spelling each one out like
{inputs.hydra_object.learning_rate}
,
{inputs.hydra_object.num_epochs}
, etc?
k
It's a shell task, so config reading would be hard. If you want to use hydra prefer using @task. Also cc @Fabio Grätz has done some Fantastic work With hydra and Flyte
🙏 1
f
Hey, this lightning talk is the best summary of our flyte/hydra integration:

https://www.youtube.com/watch?v=tghvVvHJi7s&t=216s&ab_channel=Flyteâ–ľ

Due do capacity reasons we didn’t open source it yet but plan to, so please let me know in case you are interested in this
🙏 1
đź‘Ť 2
s
cc: @Shivay Lamba
c
Hi @Fabio Grätz thanks this is a great video. I'd def be interested if you open-sourced this. I think there's a lot of value in integrating
hydra-core
with
flyte
k
@Chris Poptic, some of the work done in hydra plugin is already done as part of
pyflyte run
🙏 1
160 Views