:wave: I have several workflows that are dependent...
# flyte-support
m
👋 I have several workflows that are dependent on the same external data, this external data appears at quasi regular intervals several times a day. I know when approximately I should wait for the next chunk of external data. What approach is best to make sure that polling is "deduplicated", i.e. that I don't query external service too often? Currently the pipeline is being run in luigi with a separate trigger application that performs the waiting only when the external data is most likely to appear and triggers the appropriate workflows. Obviously this can be reused with flyte but I was hoping that triggering can be implemented inside flyte too. The approaches I have in mind are the following: 1. create a special waiting task with caching to make sure that only one of these tasks is being run at the same time and once the data appears this fact will be retrieved from cache by different workflows 2. create a special sensor that somehow prevents making external requests for the same data too often (maybe creating a singleton state class for these sensor tasks would work?) What is the best solution for this? Perhaps I'm missing some other, simpler approach?
a
• The best approach would be to push instead of poll, if you have control over it. This means your external data source has to push the new data when modified/added • Next would be to use a queue which does the polling and pushes it to your workflow, intead of having a task that constantly polls. This queue is separate from your workflow logic. • Last would be to run your polling task in specific interval as per business case. For example if you want the data to be updated for task every morning then you can run the polling once the night before. This will be a business descision to make, not technical
m
Debasis, thanks for the prompt reply!
The best approach would be to push instead of poll, if you have control over it.
I like this approach but sadly I have no control of external data source...
Next would be to use a queue which does the polling and pushes it to your workflow, instead of having a task that constantly polls.
It's a separate service, something like the "separate trigger application" that I mentioned? Do I understand correctly that pushing to workflow means executing the workflow? So what about custom sensor? Is it a bad idea or/and not feasible solution?
a
The second approach is a pub/sub mechanism, where the workflow can subscribe to your queue. A custom code can be developed which checks the data source for any modification and the workflow subscribes to it, so when a new message/data arrives the workflow is triggered. As per your message you have this currently, so you can move the task to flyte. The frequency will depend on how frequently is it acceptable by business. Since you have no control over the data source trigerring, the only approach would be to poll, whether you use a external application like a custopn code which polls and pushes the changes to trigger a workflow OR a flyte task which polls and triggrers a workflow. Bringing the application into flyte as a task has to be discussed with your architect as ideally you would like to have separation of concern, but can be done via flyte task.
m
here the workflow can subscribe to your queue
Is subscription mechanism provided by Flyte? And once again, sorry for repeating the question: So what about custom sensor? Is it a bad idea or/and not feasible solution? It will still be polling but done in a different way.
a
There is no problem with custom sensors. You can use a custom sensor which checks for data modification and triggers the workflow. The approach depends on the frequency, what kind of data source are we talking about. You have to make sure that you don’t constantly query your datasource loading it unnecessarily. The frequency of polling will depend on the business logic and you can keep it configurable. So if there is a need to increase and decrease you have the flexibility to do that.
👍 1
All the 3 approaches mentioned above are valid and which one you choose is dependent on the business.
m
You mentioned subscription to pub/sub by workflow. Is it supported by Flyte directly or I just need to manually trigger the workflow somewhere, e.g. in a lambda?
a
Yes lambda will do. Below is a discussion thread that might be helpful: https://discuss.flyte.org/t/16185988/hello-quick-question-what-s-best-pattern-of-triggering-flyte
m
Thanks, Debasis. I must admit that what is the best approach in my specific case is still not clear for me. But I began to understand the whole picture better.