Hey folks! We have a use cases to trigger Flyte wo...
# flyte-support
c
Hey folks! We have a use cases to trigger Flyte workflows when some external dependency is ready, for example, a Iceberg/Parquet dataset was updated or there's a new event. Is there a way to configure Flyte's scheduler to listen to these events? From the docs, it looks like this use case not supported atm.
t
this is not currently possible in the general case. this is partially supported in the Union deployed variant of flyte but not the oss platform by itself.
the reason for this though is mostly because this is largely an orthogonal exercise… what you’re describing in the general case is a webhook interface, that can watch arbitrary files, and launch tasks & workflows as a result. the actual kicking off of a workflow is nothing more than a grpc call with the provided client.
the webhook system would need to offer a seamless way to customize the watching of files across any number of sources, handle long-lived authentication/authorization, etc.
to achieve what you want with the tools available today, I think you should look at https://docs.flyte.org/en/latest/deployment/agents/sensor.html#deployment-agent-setup-sensor
sensor tasks, if you follow them up with other tasks, are a way to respond to files and other resources landing
c
Thanks! The Sensor task seems to cover the use case. Curious, what's not supported yet?
t
i guess beyond the semantics of it… it’s not really reactive. that’s probably the best way to describe it
you, or a schedule, still have to kick off the workflow.
like it’s still a different paradigm from the fully-featured triggering/webhook system you might imagine
but again, it’s kinda hard to justify building that as a flyte thing
c
Is it fair to say that this seems very similar to Airflow? The DAG schedules the workflow and within the task users have specify the upstream task they want to depend on.
t
possibly? i mean it wasn’t copied from airflow, but if the semantics feel familiar then so much the better.
c
Fair. Would you know what will happen in this case: say a Flyte wf is scheduled to run every hour and it depends on some S3 file to be present. For whatever reason, the S3 file creation is delayed by 2 hours. 2 hours later, will there be 2 workflow instances running?
g
yup, you could create a custom sensor that watch those events. https://github.com/flyteorg/flytekit/blob/master/flytekit/sensor/file_sensor.py#L5-L14
👀 1
2 hours later, will there be 2 workflow instances running?
for now yes, but there is a RFC about execution concurrency that aims to address this issue. https://github.com/flyteorg/flyte/pull/5659/files
t
basically if you know how many files will land, or if there is a regular cadence, then you can use this approach.
c
These resources are super helpful! I have few more question
this [event based scheduling] is partially supported in the Union deployed variant of flyte but not the oss platform by itself.
Curious to understand what's the gap in Sensors
t
but if the goal is to react to an unknown number of events, then this pattern becomes a lot harder. you’d basically have to kick off a new execution at the end of each one.
Union does provide a triggering mechanism (that can handle this unknown-count case), but the triggering mechanism is through the higher order construct of an Artifact. To complete the full system you probably have in mind, you’d need a lambda function or something that can take arbitrary incoming files/tables/etc, and create an artifact instance out of them
c
Got it! Is there a public facing tracker to track this and RFC implementation?
t
well the union effort is closed source only (because of the reliance on additional infrastructure).
the amount of additional infrastructure, and the additional support burden on flyte contributors, make it a hard sell i think for such a system to be bundled in with core flyte. but over time, who knows, maybe it will make sense to develop something. currently there is not an rfc. feel free to start an rfc incubator discussion though. maybe over time ideas can be collected there.
c
Oh you are referring to this. Another tactical question, I wonder if I can wire multiple sensors. For example,
sensor(file_1_ready) && sensor(file_2_ready) >> first_task
If
file_1
&
file_2
sensors aren't true at the same time then
first_task
won't be triggered, ie, both files should land at same time and both sensors should evaluate at same time. Is my understanding right?
t
i think you can just chain them… but yeah, or
Copy code
s1 = sensor(...)
s2 = sensor(...)
s1 >> t1
s2 >> t1
check that but i think that also works
c
Interesting, I'd have read this as, if either s1 or s2 are true then trigger t1 🤔
t
both will block t1, the >> operator just creates an artificial dependency when there is no data dependency
c
Got it. Last question, why is the scheduling considered orthogonal to Flyte's use case?
t
scheduling is not… normal cron scheduling is built in
c
Ah - I see. Event based triggering is orthogonal.
Thank you for all the resources.