git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Basic modeling question


Gabriel -

One approach I've seen for a similar use case is to have multiple related
rollups in one DAG that runs daily, then have the non-daily tasks skip most
of the time (e.g., weekly only actually executes on Sundays and is
parameterized to look at the last 7 days).

You could implement that not running part a few ways, but one idea is a
sensor in front of the weekly rollup task.  Imagine a SundaySensor like return
execution_date.weekday() == 6.  One thing to keep in mind here is
dependence on the DAG's cron schedule being more granular than the tasks.

I think this could generalize into a DayOfWeekSensor / DayOfMonthSensor
that would be nice to have.

Of course this does mean some scheduler inefficiency on the skip days, but
as long as those skips are fast and the overall number of tasks is small, I
can accept that.

*Taylor Edmiston*
Blog <https://blog.tedmiston.com/> | CV
<https://stackoverflow.com/cv/taylor> | LinkedIn
<https://www.linkedin.com/in/tedmiston/> | AngelList
<https://angel.co/taylor> | Stack Overflow
<https://stackoverflow.com/users/149428/taylor-edmiston>


On Wed, Aug 8, 2018 at 1:11 PM, Gabriel Silk <gsilk@xxxxxxxxxxx.invalid>
wrote:

> Hello Airflow community,
>
> I have a basic question about how best to model a common data pipeline
> pattern here at Dropbox.
>
> At Dropbox, all of our logs are ingested and written into Hive in hourly
> and/or daily rollups. On top of this data we build many weekly and monthly
> rollups, which typically run on a daily cadence and compute results over a
> rolling window.
>
> If we have a metric X, it seems natural to put the daily, weekly, and
> monthly rollups for metric X all in the same DAG.
>
> However, the different rollups have different dependency structures. The
> daily job only depends on a single day partition, whereas the weekly job
> depends on 7, the monthly on 28.
>
> In Airflow, it seems the two paradigms for modeling dependencies are:
> 1) Depend on a *single run of a task* within the same DAG
> 2) Depend on *multiple runs of task* by using an ExternalTaskSensor
>
> I'm not sure how I could possibly model this scenario using approach #1,
> and I'm not sure approach #2 is the most elegant or performant way to model
> this scenario.
>
> Any thoughts or suggestions?
>


( ! ) Warning: include(msgfooter.php): failed to open stream: No such file or directory in /var/www/git/apache-airflow-development/msg04172.html on line 125
Call Stack
#TimeMemoryFunctionLocation
10.0014368744{main}( ).../msg04172.html:0

( ! ) Warning: include(): Failed opening 'msgfooter.php' for inclusion (include_path='.:/var/www/git') in /var/www/git/apache-airflow-development/msg04172.html on line 125
Call Stack
#TimeMemoryFunctionLocation
10.0014368744{main}( ).../msg04172.html:0