Good morning, I have been looking for a way to run...
# flyte-support
w
Good morning, I have been looking for a way to run backfill jobs. Typically we run daily batch jobs, which will work well with the Launch Plan. But being able to process old datasets each identified by a date is important (schemas change, things fail, etc.) I have been trying to generate lists of dates to process (something like generate_dates(start_date, num_days_back) -> List[str]). If I hardcode a for loop (for i in range(10)) and create dates that way with no particular inputs the workflow executes correctly, but it would be really nice to be able to take start date and number of days back as an input to the workflow (the blocker is that we cannot iterate over a Promise). I see some discussion on backfilling on Slack, so I am probably just missing something obvious. Is this already baked into LaunchPlan? Would love to hear how others have solved this before 🙂 This is the last major hurdle before we are ready to put Flyte into prod - it’s such an improvement to Luigi.
t
@wooden-sandwich-59360, good to know that you found Flyte helpful. We don’t have a backfill feature yet. It is on our roadmap and will be implemented soon. cc @freezing-airport-6809
👍 3
f
We were just talking about it
There is a very cheap way to do backfill, cc @thankful-minister-83577
@wooden-sandwich-59360 would you be open to trying a simple way to backfill and then we can do a more systematic integration
Cc @most-gold-65483 - we will share a quick trick - and would love your feedback
t
but it would be really nice to be able to take start date and number of days back as an input to the workflow (the blocker is that we cannot iterate over a Promise).
You can use a dynamic workflow to perform iterations over workflow inputs. (thanks @hallowed-mouse-14616)
f
@tall-lock-23197 better would be to use imperative workflow to construct a static backfilling workflow that can be tracked and generated
w
@freezing-airport-6809 We can give it a go, what did you have in mind?
f
Sure it’s off today in the us, we will send a prototype
@wooden-sandwich-59360 here is what i wrote up in the last 20 minutes. So please pardon boundary conditions etc. But the basic idea is to generate a backfiller workflow on demand and register it using flyteremote. The workflow can be generated using any launchplan that has a schedule We will add this probably to pyflyte - or you can contribute. The code is in this gist. https://gist.github.com/kumare3/718c76cedce0f75af5321d1c23e68d6c On generating i was able to successfully register dummy_wf
@thankful-minister-83577 / @broad-monitor-993 / @most-gold-65483 please TAL
cc @steep-jackal-21573
b
Hi @wooden-sandwich-59360 just wanted to clarify something about this:
but it would be really nice to be able to take start date and number of days back as an input to the workflow
If I understand correctly, is this what you want to achieve at a high level? Assumptions: • I have a
process_dataset
routine that processes a single dataset for a particular date • Every day I process today’s dataset, in addition to datasets for some number of days back. • Each invocation of
process_dataset
for a particular day’s dataset is completely independent from processing other day’s datasets. Steps: Every day, kick off scheduled job at `kickoff_datetime`: 1. Pass in
kickoff_datetime
into main launchplan entrypoint, in addition to a
num_days_back
parameter 2. Generate a list of dates, which includes today’s date and some
num_days_back
worth of days 3. Apply
process_dataset
to each dataset in an embarrassingly parallel fashion Does this seem correct?
f
@wooden-sandwich-59360 i will make one more change by enforcing a dependency between “each lp” / node and this you can get serial execution if thats what you want
or if you just want parallel, then you can simply control this using max-parallelism
w
@broad-monitor-993 Your understanding looks correct. I wanted to point at the distinction between backfill and look-back processing. Backfill is “reprocessing”, while look-back is used to aggregate up a window (weekly, monthly, etc.). Look-back lets us know that an aggregate task is ready to run, because the previous week is complete. Hopefully the look-back is simply confirming previous successful runs throughout the week. Another detail here is that look-back most likely can use datetime.now() to figure out which dates it should check/process, while backfill needs to be able to take date as an input for more control over the window to be backfilled. @freezing-airport-6809 Serial processing between the “same” workflow is not important, parallel processing should do the trick. I’ll implement your suggestion sometime in the next few days and give feedback.
❤️ 1
@freezing-airport-6809 Gave your suggestion a go and it works great. Thanks for the input!
💯 2
b
[flyte-docs] Hey @wooden-sandwich-59360 would you mind filling in a docs issue below 👇 so that we can write a user guide example for backfilling?
w
For sure, will have it done sometime tomorrow.
🙏 1
f
@wooden-sandwich-59360 should we make it a ‘pyflyte backfill —lp name —start-date …’
👀 1
w
I think that makes sense! Docs issue is filed.
❤️ 1
f
Cc @broad-monitor-993
171 Views