Looking for some advice for integrating a task into a Flyte workflow. This task that is currently initiated by cron. This task scales horizontally with the number of messages in an SQS queue. I am currently using KEDA and AWS ASG of spot instances to achieve this. There can be > 10k messages in the queue when the task is started. This task does ~30 seconds setup when started. I was initially looking at using map_task to run this, but given the number of messages and duration of setup, I am thinking this is not a good fit. Thoughts? Thanks!
10/02/2023, 6:26 PM
interesting, so if its a cron job does it trigger >10k message at a given time? i think map_task is probably better than dynamic in that case.
also how compute intensive are they, if its light is can it be just done inside a single task or batch them and process multiple messages in a single task
10/02/2023, 6:48 PM
A separate system is continuously populating the queue, but the cron job that kicks off the task runs hourly. Each message represents multiple "units of work" (essentially pointers to objects in s3). The message is packed with "units of work" up to the 256k message size limit. The workload is very cpu intensive.
10/02/2023, 7:56 PM
i see, in that case i think map task should work well over dynamic task. @Yee do you have any insight? is map_task faster to setup?
10/02/2023, 9:44 PM
i don’t think i’ve seen this use-case before, but what’s the concern with map task? the start-up duration doesn’t matter that much right? you can just spin up like 5-10 tasks to read the 10k right?
i think the issues that can arise are more nuanced in this case - what’s the stopping criteria? will new messages hit the queue while you’re processing them, are there any concerns around concurrency of multiple messages being processed at the same time, etc
what’s the use case btw?
10/02/2023, 10:44 PM
@Jon I would implement a Flyte Agent or a python task, that is up reading from a queue and working on it
there would be no KEDA, in this case, as Flyte does not integrate with it today. We plan to in the future
FYI @Haytham Abuelfutuh / @Dan Rammer (hamersaw) (no comments needed, just FYI)
10/03/2023, 2:00 AM
@Yee, I should have been more specific the external system is producing a stream of data, but writing to the queue in bulk at the start of each hour. The use case is can be thought of as stream processing. The consumers of our data can tolerate ~1 hour of delay before data is available to them. The setup described above ends up being very reliable and cost-effective. I appreciate the responses!
@Ketan (kumare3), thanks! Flyte Agents look interesting and would do most of what I am looking for. Also good to know about potential integration with KEDA in the future.
10/05/2023, 4:46 PM
Late to the party, I think what you, @Jon, have built is just fine! You are using KEDA to scale up based on pod events, right? an integration with flyte isn't strictly required but of course it can speed up scaling the cluster so it's ready for when the pods land.
Are there concerns about partial failures? What if a pod starts, reads a message, processes a few files before some catastrophic event happens (e.g. the spot machine disappears or the pod crashes)? Are the operations that pod perform idempotent (ideally)?