In my organization, ETL workflows are owned by another team to generate data that my ML modeling workflows depend on.
So, the ETL workflow is a daily scheduled launch plan (LP1). my modeling workflow is another daily scheduled launch plan (LP2).
I need the daily execution of LP2 check for the successful completion of execution of LP1 for the same day before starting to execute.
In airflow we did it via ExternalTaskSensor, is there a way to do it in Flyte?
Could you share an existing pattern to avoid re-inventing the wheel?
03/01/2023, 9:06 PM
Is the data generated by LP1 pushed to a table or an s3 path?
In case one, you could poll the db for modified rows / tables / new partitions. Possibly use the DB metadata. In case 2, checking the existence of a path on s3 and sleep if they don’t exist could work.
To fully copy the the airflow sensor, you would need to check Flyte’s metadata store for executions of a specific workflow. I know we can search for execution by ID. Not sure about getting all executions for a workflow.
03/02/2023, 7:37 AM
Sounds like a plan. A lot of our community members have home-grown solutions for these kind of use-cases. In your case, since LP2 is dependent on LP1, you can store the data returned by the ETL workflow in a DB, and check if the data is available for a certain period of time. The check can be encapsulated in a task in your modelling workflow. You can also use FlyteRemote within your modelling workflow to check for the status of your ETL workflow, but not sure if that's a good way to go about. In a nutshell, the launch plan will be triggered daily irrespective of the status of your ETL workflow; you need to handle the verification part within your modelling workflow.