git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: execution_date - can we stop the confusion?


I think if you have a functional mindset (as in "functional data engineering
<https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a>")
as opposed to a cron mindset, using the left bound of the time interval
makes a lot of sense. Things like your daily table partition keys align
with your Airflow execution_date.

The main thing is that whatever we do we cannot break backwards
compatibility. Offering both views (left bound/right bound), as it's been
proposed before, either as an environment setting or a user personal
preference is even more confusing to me personally. Users would have to
switch context as they help each other or change environments.

Also note that your intuition may differ from other people's intuition, and
that "unlearning" something is way harder than learning something.

My personal take on this is to make this a rite of passage. This is just
one of the many thing you have to learn when learning Airflow.

Max

On Wed, Sep 26, 2018 at 8:18 AM Sam Elamin <hussam.elamin@xxxxxxxxx> wrote:

> Hi Bolke
>
> Speaking as a consultant who is constantly training other teams how to use
> airflow, I do frequently see this confusion.
> Another one is how the batch_date is always batch_date + interval or as the
> docs make it quite clear
>
> "*Let’s Repeat That* The scheduler runs your job one schedule_interval
> AFTER
> the start date, at the END of the period."
>
> Renaming it would make it simpler for newbies, but essentially they will
> need to understand how Airflow behaves, execution_date being the batch
> execution date not the run_date of the DAG
>
> I am actually in the process of writing a blog post
> <https://samelamin.github.io/2017/04/27/Building-A-Datapipeline-part1/>
> about this which I could use peoples feedback
>
> If it helps, I find that explaining how backfills work and why they are
> important will drive home what the execution_date is :)
>
>
> Regards
> Sam
>
>
>
> On Wed, Sep 26, 2018 at 4:10 PM Bolke de Bruin <bdbruin@xxxxxxxxx> wrote:
>
> > I dont think this makes sense and I dont that think anyone had a real
> > issue with this. Execution date has been clearly documented  and is part
> of
> > the core principles of airflow. Renaming will create more confusion.
> >
> > Please note that I do think that as an anonymous user you cannot speak
> for
> > any "new airflow user". That is a contradiction to me.
> >
> > Thanks
> > Bolke
> >
> > Sent from my iPhone
> >
> > > On 26 Sep 2018, at 07:59, airflowuser <airflowuser@xxxxxxxxxxxxxx
> .INVALID>
> > wrote:
> > >
> > > One of the most annoying, hard to understand and against all common
> > sense is the execution_date behavior. I assume that any new Airflow user
> > has been struggling with it.
> > > The amount of questions with answers referring to :
> > https://airflow.apache.org/scheduler.html?scheduling-triggers  is
> > uncountable.
> > >
> > > Most people mistakenly think that execution_date is the datetime which
> > the DAG started to run.
> > >
> > > I suggest the following changes:
> > > 1. Renaming the execution_date to something else like: run_stamped
> >  This name won't cause people to get confused.
> > > 2. Adding a new variable which indicated the actual datetime when the
> > DAG run was generated. call it execution_start_date. People seem to want
> > the information when the DAG actually started to be executed/run.
> > >
> > > This is only naming changes. No need to actual change the behavior -
> > This will only make things simpler as when user encounter  run_stamped
> he
> > won't be confused by the name like execution_date
> >
>