git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Basic modeling question


To expand on Taylor's idea

I recently wrote a ScheduleBlackoutSensor that would allow you to prevent a
task from running if it meets the criteria provided. It accepts an array of
args for any number of the criteria so you could leverage this sensor to
provide "blackout" runs for a range of days of the week.

https://github.com/apache/incubator-airflow/pull/3702/files

For example,

task = ScheduleBlackoutSensor(day_of_week=[0,1,2,3,4,5], dag=dag)

Would prevent a task from running Monday - Saturday, allowing it to run on
Sunday.

You could leverage this Sensor as you would any other sensor or you could
invert the logic so that you would only need to specify

task = ScheduleBlackoutSensor(day_of_week=6, dag=dag)

To "whitelist" a task to run on Sundays.


Let me know if you have any questions

On Wed, Aug 8, 2018 at 1:47 PM Taylor Edmiston <tedmiston@xxxxxxxxx> wrote:

> Gabriel -
>
> One approach I've seen for a similar use case is to have multiple related
> rollups in one DAG that runs daily, then have the non-daily tasks skip most
> of the time (e.g., weekly only actually executes on Sundays and is
> parameterized to look at the last 7 days).
>
> You could implement that not running part a few ways, but one idea is a
> sensor in front of the weekly rollup task.  Imagine a SundaySensor like
> return
> execution_date.weekday() == 6.  One thing to keep in mind here is
> dependence on the DAG's cron schedule being more granular than the tasks.
>
> I think this could generalize into a DayOfWeekSensor / DayOfMonthSensor
> that would be nice to have.
>
> Of course this does mean some scheduler inefficiency on the skip days, but
> as long as those skips are fast and the overall number of tasks is small, I
> can accept that.
>
> *Taylor Edmiston*
> Blog <https://blog.tedmiston.com/> | CV
> <https://stackoverflow.com/cv/taylor> | LinkedIn
> <https://www.linkedin.com/in/tedmiston/> | AngelList
> <https://angel.co/taylor> | Stack Overflow
> <https://stackoverflow.com/users/149428/taylor-edmiston>
>
>
> On Wed, Aug 8, 2018 at 1:11 PM, Gabriel Silk <gsilk@xxxxxxxxxxx.invalid>
> wrote:
>
> > Hello Airflow community,
> >
> > I have a basic question about how best to model a common data pipeline
> > pattern here at Dropbox.
> >
> > At Dropbox, all of our logs are ingested and written into Hive in hourly
> > and/or daily rollups. On top of this data we build many weekly and
> monthly
> > rollups, which typically run on a daily cadence and compute results over
> a
> > rolling window.
> >
> > If we have a metric X, it seems natural to put the daily, weekly, and
> > monthly rollups for metric X all in the same DAG.
> >
> > However, the different rollups have different dependency structures. The
> > daily job only depends on a single day partition, whereas the weekly job
> > depends on 7, the monthly on 28.
> >
> > In Airflow, it seems the two paradigms for modeling dependencies are:
> > 1) Depend on a *single run of a task* within the same DAG
> > 2) Depend on *multiple runs of task* by using an ExternalTaskSensor
> >
> > I'm not sure how I could possibly model this scenario using approach #1,
> > and I'm not sure approach #2 is the most elegant or performant way to
> model
> > this scenario.
> >
> > Any thoughts or suggestions?
> >
>


( ! ) Warning: include(msgfooter.php): failed to open stream: No such file or directory in /var/www/git/apache-airflow-development/msg04173.html on line 158
Call Stack
#TimeMemoryFunctionLocation
10.0020368752{main}( ).../msg04173.html:0

( ! ) Warning: include(): Failed opening 'msgfooter.php' for inclusion (include_path='.:/var/www/git') in /var/www/git/apache-airflow-development/msg04173.html on line 158
Call Stack
#TimeMemoryFunctionLocation
10.0020368752{main}( ).../msg04173.html:0