Good morning, I have been looking for a way to run...
# ask-the-community
f
Good morning, I have been looking for a way to run backfill jobs. Typically we run daily batch jobs, which will work well with the Launch Plan. But being able to process old datasets each identified by a date is important (schemas change, things fail, etc.) I have been trying to generate lists of dates to process (something like generate_dates(start_date, num_days_back) -> List[str]). If I hardcode a for loop (for i in range(10)) and create dates that way with no particular inputs the workflow executes correctly, but it would be really nice to be able to take start date and number of days back as an input to the workflow (the blocker is that we cannot iterate over a Promise). I see some discussion on backfilling on Slack, so I am probably just missing something obvious. Is this already baked into LaunchPlan? Would love to hear how others have solved this before 🙂 This is the last major hurdle before we are ready to put Flyte into prod - it’s such an improvement to Luigi.
s
@Fredrik Lyford, good to know that you found Flyte helpful. We don’t have a backfill feature yet. It is on our roadmap and will be implemented soon. cc @Ketan (kumare3)
k
We were just talking about it
There is a very cheap way to do backfill, cc @Yee
@Fredrik Lyford would you be open to trying a simple way to backfill and then we can do a more systematic integration
Cc @Pradithya Aria Pura - we will share a quick trick - and would love your feedback
s
but it would be really nice to be able to take start date and number of days back as an input to the workflow (the blocker is that we cannot iterate over a Promise).
You can use a dynamic workflow to perform iterations over workflow inputs. (thanks @Dan Rammer (hamersaw))
k
@Samhita Alla better would be to use imperative workflow to construct a static backfilling workflow that can be tracked and generated
f
@Ketan (kumare3) We can give it a go, what did you have in mind?
k
Sure it’s off today in the us, we will send a prototype
@Yee / @Niels Bantilan / @Pradithya Aria Pura please TAL
cc @honnix
n
Hi @Fredrik Lyford just wanted to clarify something about this:
but it would be really nice to be able to take start date and number of days back as an input to the workflow
If I understand correctly, is this what you want to achieve at a high level? Assumptions: • I have a
process_dataset
routine that processes a single dataset for a particular date • Every day I process today’s dataset, in addition to datasets for some number of days back. • Each invocation of
process_dataset
for a particular day’s dataset is completely independent from processing other day’s datasets. Every day, kick off scheduled job at `kickoff_datetime`: 1. Pass in
kickoff_datetime
into main entrypoint, in addition to a
num_days_back
parameter 2. Generate a list of dates, which includes today’s date and some
num_days_back
worth of days 3. Apply
process_dataset
to each dataset in an embarrassingly parallel fashion Does this seem correct?
k
@Fredrik Lyford i will make one more change by enforcing a dependency between “each lp” / node and this you can get serial execution if thats what you want
or if you just want parallel, then you can simply control this using max-parallelism
f
@Niels Bantilan Your understanding looks correct. I wanted to point at the distinction between backfill and look-back processing. Backfill is “reprocessing”, while look-back is used to aggregate up a window (weekly, monthly, etc.). Look-back let’s us know that an aggregate task is ready to run, because the previous week is complete. Hopefully the look-back is simply confirming previous successful runs throughout the week. Another detail here is that look-back most likely can use datetime.now() to figure out which dates it should check/process, while backfill needs to be able to take date as an input for more control over the window to be backfilled. @Ketan (kumare3) Serial processing between the “same” workflow is not important, parallel processing should do the trick. I’ll implement your suggestion sometime in the next few days and give feedback.
@Ketan (kumare3) Gave your suggestion a go and it works great. Thanks for the input!
n
[flyte-docs] Hey @Fredrik Lyford would you mind filling in a docs issue below 👇 so that we can write a user guide example for backfilling?
f
For sure, will have it done sometime tomorrow.
k
@Fredrik Lyford should we make it a ‘pyflyte backfill —lp name —start-date …’
f
I think that makes sense! Docs issue is filed.
k
Cc @Niels Bantilan