Hi all Hope this is the best channel for this question I hav Flyte #flyte-support

Hi all. Hope this is the best channel for this qu...

prehistoric-mechanic-34647

08/11/2022, 10:36 PM

Hi all. Hope this is the best channel for this question. I have a project consisting of multiple "steps", each of which must be executed from a

.py

file (e.g. download_data.py, preprocess_data.py, train_model.py, eval_model.py, etc). Currently I have wrangled these scattered .py scripts into somewhat of a workflow using a Makefile such that each step in the pipeline can be executed through a

make

command (e.g.

make download_data

make_preprocess_data

, etc). The target of each Makefile step calls a .

sh

shell script that executes the

.py

file for that step. The command

make run_entire_pipeline

calls each of the ~7 steps in sequence, as a rudimentary (linear) DAG. Obviously this rough pipeline misses a few benefits such as caching earlier steps such that they do not need to be executed if they've already been performed (e.g. no need to re-download data on a subsequent model training pipeline run if the data has already been downloaded on an earlier run of the pipeline and if there have been no changes in that data). What is the best way to migrate this Make-based workflow into a Flyte-based workflow? Specifically is there a way to map each

.py

scripts to a

@task

when building a

@workflow

pipeline in Flyte? I learned about the Flyte "Script mode", and it sounds somewhat akin to what I'm trying to do, but I'm totally new to Flyte. Thanks for any help and direction. I'm working with very large digital pathology whole slide image (WSI) images, BTW. Does Flyte support inputs of the WSI variety? I.e.

.mrxs

.tiff

.czi

.jpeg

.png

, etc?

👋 4

freezing-airport-6809

08/12/2022, 2:45 AM

Hi @prehistoric-mechanic-34647 firstly welcome to the community. I will answer questions from your laptop unless someone else beats me

🙏 1

freezing-airport-6809

08/12/2022, 4:02 AM

Obviously this rough pipeline misses a few benefits such as caching earlier steps such that they do not need to be executed if they’ve already been performed (e.g. no need to re-download data on a subsequent model training pipeline run if the data has already been downloaded on an earlier run of the pipeline and if there have been no changes in that data).

The benefits you get is - failure tolerance, distributed execution, caching and isolation. Today you do not get the benefit of re-using data that has been downloaded already

What is the best way to migrate this Make-based workflow into a Flyte-based workflow? Specifically is there a way to map each
.py
scripts to a
@task
when building a
@workflow
pipeline in Flyte? I learned about the Flyte “Script mode”, and it sounds somewhat akin to what I’m trying to do, but I’m totally new to Flyte. Thanks for any help and direction.

There are 2 ways 1. Use the ShellTask to model what you had today with little more data passing. Thus model it as a Flyte workflow 2. Or update your scripts to have a

task

function each

Copy code

@task
def foo(...):
  globals ...

3. You can also mix and match. The workflow can also be constructed either using imperative model or using the

@workflow

syntax/DSL Note : you can ofcourse mix and match and slowly migrate if you want. Ideally migrate to the

@task

syntax as this is already python

I’m working with very large digital pathology whole slide image (WSI) images, BTW. Does Flyte support inputs of the WSI variety? I.e.
.mrxs
,
.tiff
,
.czi
,
.jpeg
,
.png
, etc

Any type of File can be handled using the FlyteFile It will automatically upload and download files to S3/GCS etc Example: Workfing Withe Files

🙏 1

prehistoric-mechanic-34647

08/12/2022, 2:57 PM

Thanks @freezing-airport-6809. This is awesome. Am I correct that we can't simply drop in the shell script into the

ShellTask

object? From the

ShellTask

docs it looks like the user would have to refactor the shell script to explicitly specify the script's inputs and outputs. Specifically using the syntax

{inputs.input_name}

and

{outputs.output_name}

. Obviously we could do this manually for each input and output. But what if you're passing in an entire dictionary of config params (using something like Hydra's ConfigDict object). Could you simply pass in all those numerous hyperparameters using

{inputs.myHydraConfigObject}

rather than spelling each one out like

{inputs.hydra_object.learning_rate}

{inputs.hydra_object.num_epochs}

, etc?

freezing-airport-6809

08/12/2022, 4:42 PM

It's a shell task, so config reading would be hard. If you want to use hydra prefer using @task. Also cc @late-pencil-3873 has done some Fantastic work With hydra and Flyte

🙏 1

late-pencil-3873

08/15/2022, 7:07 AM

Hey, this lightning talk is the best summary of our flyte/hydra integration:

https://www.youtube.com/watch?v=tghvVvHJi7s&t=216s&ab_channel=Flyte▾

Due do capacity reasons we didn’t open source it yet but plan to, so please let me know in case you are interested in this

🙏 1

👍 2

tall-lock-23197

08/16/2022, 1:47 PM

cc: @victorious-park-53030

prehistoric-mechanic-34647

08/16/2022, 10:02 PM

Hi @late-pencil-3873 thanks this is a great video. I'd def be interested if you open-sourced this. I think there's a lot of value in integrating

hydra-core

with

flyte

freezing-airport-6809

08/17/2022, 12:01 AM

@prehistoric-mechanic-34647, some of the work done in hydra plugin is already done as part of

pyflyte run

🙏 1

166 Views

Open in Slack

Previous Next