git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Dealing with data latency


Hi James,
I've noticed that some dags fail if the services are restarted while a
sensor is waiting. Originally I didn't think retries would be relevant for
a time sensor but it sounds like if the worker crashes, the only way for
the sensor to rerun is if the retry count hasn't been met. Is this one of
the points you are making?
Thanks.

On Tue, Jun 5, 2018 at 9:41 AM James Meickle <jmeickle@xxxxxxxxxxxxxx>
wrote:

> We have to use a lot of time sensors like this, for reports that shouldn't
> be filed to a third party before a certain time of day. Since these sensors
> are themselves tasks, they can fail to be scheduled or can fail, like if
> the underlying worker instance dies. I would recommend double checking your
> concurrency settings (esp. since you will have multiple days worth of DAGs
> concurrently running) and your retry settings.
>
> On Tue, Jun 5, 2018 at 10:34 AM, Pedro Machado <pedro@xxxxxxxxxxxxxx>
> wrote:
>
> > Thanks, Max!
> >
> > On Mon, Jun 4, 2018 at 12:47 PM Maxime Beauchemin <
> > maximebeauchemin@xxxxxxxxx> wrote:
> >
> > > The common standard is to have the execution_date aligned with the
> > > partition date in the database (say 2018-08-08) and contain data from
> > > 2018-08-08T00:00:000
> > > to 2018-08-09T23:59:999.
> > >
> > > The partition date and execution_date match and correspond to the left
> > > bound of the time interval processed.
> > >
> > > Then you'd use some sensors to make sure this cannot run until the
> > desired
> > > time or conditions are met.
> > >
> > > Max
> > >
> > > On Mon, Jun 4, 2018 at 5:46 AM Pedro Machado <pedro@xxxxxxxxxxxxxx>
> > wrote:
> > >
> > > > Hi. What is the recommended way to deal with data latency? For
> > example, I
> > > > have a feed that is not considered final until 72 hours have passed
> > after
> > > > the end of the daily period.
> > > >
> > > > For example, Monday's data would be ready by Thursday at 23:59.
> > > >
> > > > Should I pull data based on the execution date minus a 72 hour offset
> > or
> > > > use the execution date and somehow delay the data pull for 72 hours?
> > > >
> > > > The latter would be more intuitive (data pull date = execution date)
> > but
> > > I
> > > > am not sure if it's a good pattern.
> > > >
> > > > Thanks,
> > > >
> > > > Pedro
> > > >
> > >
> >
>


( ! ) Warning: include(msgfooter.php): failed to open stream: No such file or directory in /var/www/git/apache-airflow-development/msg03575.html on line 135
Call Stack
#TimeMemoryFunctionLocation
10.0007368760{main}( ).../msg03575.html:0

( ! ) Warning: include(): Failed opening 'msgfooter.php' for inclusion (include_path='.:/var/www/git') in /var/www/git/apache-airflow-development/msg03575.html on line 135
Call Stack
#TimeMemoryFunctionLocation
10.0007368760{main}( ).../msg03575.html:0