Weide Zhang

03/22/2022, 12:48 AM
Hi, I’m new to flyte but I found this project interesting. i was building a training pipeline that uses multiple k8s cluster (one is public aws eks cluster and another is private on-premise k8s cluster). The training dataset (petastorm format) needs to be generated by aws eks and stores in s3 path and copied over to local premise. And then local premise cluster will kick off distributed horovod training consuming the generated dataset (if the dataset already exists (already synced across cluster), no copy is needed). In order to achieve that, what’s the best practice in Flyte ? How many workflows is needed ?

Haytham Abuelfutuh

03/22/2022, 1:19 AM
Hey Weide, welcome to the flyte community! Given what you said to go with, you will need a workflow running on the eks cluster doing the training and one of its outputs is maybe the dataset generated. You'll need another workflow that has a task to do the copying and it takes as an input a dataset. You can leverage the built-in caching mechanism to avoid running that task again if the same dataset is passed in. You can optionally have a driver workflow that coordinates these two workflows if you want... Happy to answer other questions if you want to dig more... Or if you want to sketch out the workflows, we can review that too

Ketan (kumare3)

03/22/2022, 3:23 AM
As @Haytham Abuelfutuh said, additionally this will need to be 2 different clusters.
But you can access data from S3 can as flytekit will Load the right driver automatically

Weide Zhang

03/22/2022, 3:50 AM
got it. thanks