git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Capturing data changes that happen after the initial data pull


Hi James. Sounds good! I'll give this a try.

I have a follow up question related to this DAG ...

The DAG is currently downloading data for 14 countries. In several
occasions, I've gotten a request to pull data for more countries. Before
Airflow, I've handled these manually by running a script that will pull
data for a specific country going back to the start date.

Is there a more elegant solution to do this in airlflow than kicking off an
adhoc DAG for the additional countries?

Also, I am thinking about using Redshift Spectrum against the resulting
files so having each country in a separate file would be desirable for
partitioning.

Is it a good pattern to create parallel tasks for each country? This would
significantly increase the number of tasks in the dag.

I am currently generating a single set of tasks that pulls data for all the
required countries.

Thanks!

On Thu, Jun 7, 2018, 4:42 AM James Meickle <jmeickle@xxxxxxxxxxxxxx> wrote:

> One way to do this would be to have your DAG file return two nearly
> identical DAGs (like put it in a factory function and call it twice). The
> difference would be that the "final" run would have a conditional to add an
> extra time sensor at the DAG root, to wait N days for the data to finalize.
> The effect would be that two DAG runs would start each day, but the latter
> would stay uncompleted for a while.
>
> Doing it that way has the advantage of still aligning with daily updates
> rather than having a daily DAG and a weekly DAG; keeping execution dates
> correct to which data is being processed; and not having the primary DAG
> runs stay open for days (better for status tracking).
>
> It has the disadvantage that any extremely long running tasks (including
> sensors) will be more likely to fail, and that you would have several more
> tasks and DAG runs open and taking resources/concurrency.
>
> On Jun 6, 2018 3:34 PM, "Pedro Machado" <pedro@xxxxxxxxxxxxxx> wrote:
>
> This is a similar case. My idea was to rerun the whole data pull. The
> current DAG is idempotent so there is no issue with inserting duplicates.
> Now, I'm trying to figure out the best way to code it in Airflow. Thanks
>
>
> On Wed, Jun 6, 2018 at 2:29 PM Ben Gregory <ben@xxxxxxxxxxxxx> wrote:
>
> > I've seen a similar use case with DoubleClick/Google Analytics (
> > https://support.google.com/ds/answer/2791195?hl=en), where the reporting
> > metrics have a "lookback window" of up to 30 days to mark conversion
> > attribution (so if a user converts on the 14th day of clicking on an ad
> it
> > will still be counted.
> >
> > What we ended up doing in that case is setting the start/stop params of
> the
> > query to the the full window (pulled daily) and then upserted in Redshift
> > based on a primary key (in our case actually a composite key with
> multiple
> > attributes). So you end up pulling a lot of redundant data but since
> > there's no way to pull only updated records (which sounds like the case
> > you're in), it's the best way to ensure your reporting is up-to-date.
> >
> >
> >
> > On Wed, Jun 6, 2018 at 1:10 PM Pedro Machado <pedro@xxxxxxxxxxxxxx>
> wrote:
> >
> > > Yes. It's literally the same API calls with the same dates, only done a
> > few
> > > days later. It's just redoing the same data pull but instead of pulling
> > one
> > > date each dag run, it would pull all dates for the previous week on
> > > Tuesdays.
> > >
> > > Thanks!
> > >
> >
> >
> > --
> >
> > [image: Astronomer Logo] <https://www.astronomer.io/>
> >
> > *Ben Gregory*
> > Data Engineer
> >
> > Mobile: +1-615-483-3653 • Online: astronomer.io <
> > https://www.astronomer.io/>
> >
> > Download our new ebook. <http://marketing.astronomer.io/guide/> From
> > Volume
> > to Value - A Guide to Data Engineering.
> >
>