git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: execution_date - can we stop the confusion?


I would like to challenge the notion that "execution_date" is well
documented. Looking at airflow.apache.org right now and searching for all
references to "execution_date", I find that the only definition of
execution_date is, "The execution date of the DAG". There are some other
passing references that imply more but nothing explicit.

>From the documentation, as currently published, it seems reasonable to
expect some concurrence between "execution_date" and when a dag executes,
especially given the heavy repetition of, "execution_date - The execution
date of the DAG".

Personally, I think the problem is the word "execution", not with which
bound is used to label/define an interval. I think this is especially
difficult for people coming to Airflow with a cron background who are not
necessarily thinking about intervals.

On Thu, Sep 27, 2018 at 11:23 AM Brian Greene <
brian@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:

> Second use of “inane” on this subject.  Brilliant, less combative response
> Chris.
>
> There’s another point.. left bound makes sense to some people, right bound
> to others.
>
> There’s no way to know or measure how “hard” this is to new users, so even
> if the change was made - new name, use right bound... how can you be sure
> you’re not actually confusing a LARGER number of new users from that point
> on.
>
> It’s like left handed versus right handed people, except there’s no
> statistical basis for your argument that one group is larger than the
> other, or that there would actually be a measurable uptick in understanding
> and usability across the ENTIRE user community.
>
> So your proposal 100% breaks backwards compatibility of code AND concept,
> on anecdotal evidence that it would somehow make usage magically easier?
>
> Airflow is like a bulldozer made out of scalpels that can fly(not well,
> but it’s possible).  A slick dag can accomplish a staggering amount of work
> with the smallest little bit of elegant code.  Learning to “think in
> airflow” though is so, so much more than understanding execution date.
> That’s barely table stakes in terms of concepts you’ll need to accept to be
> effective with airflow.
>
> Maybe somebody just has a thing against lefty’s?  Some kind of
> left-bound-thinking conspiracy?
>
> Sent from a device with less than stellar autocorrect
>
> > On Sep 27, 2018, at 12:56 PM, Chris Palmer <chris@xxxxxxxxxxxx> wrote:
> >
> > While taking a step back makes some sense, we also need to identify what
> > the issue is. Simply saying 'execution_date behavior is confusing to new
> > users' isn't good enough. What is confusing about it? Is it what it
> > represents, or just the name itself?
> >
> > There are a number of different timestamps that might be of interest,
> > including (but not limited to):
> >
> > *Identifying timestamp*
> > For any time interval, there are two natural choices of timestamps to
> > represent that interval, the left and right bounds. For Airflow the left
> > bound has been chosen, and is called execution_date. For various
> reasons, I
> > think that makes a much better choice than the right bound.
> >
> > *Create/update/delete timestamps*
> > Timestamps representing when particular database records where created,
> > updated and or deleted. I don't believe that Airflow currently records
> > these.
> >
> > *Runtime timestamps*
> > The timestamps that a task or other process started and stopped. Airflow
> > records these for Tasks, but I think the implementation is maybe a little
> > lacking for DagRuns.
> >
> >
> > So what's the confusion with execution_date? Is it what it represents or
> > the name itself?
> >
> > I think part of the learning curve with Airflow is understanding that
> > execution_date is the left bound of the interval. No matter what name you
> > use for the identifying timestamp I think new users will need to learn
> what
> > that choice means. Changing the name won't magically make all the
> confusion
> > go away.
> >
> > While I don't think execution_date is the greatest name in the world,
> it's
> > a lot better than the suggested alternative run_stamped. Tasks also have
> an
> > identifying timestamp, and if I saw run_stamped on a Task I would have no
> > idea what it means (stamped by what?).
> >
> > While there may be better names than execution_date, I don't think they
> are
> > so much better that it is worth the effort to overhaul such an integral
> > part of Airflow. Maybe some improvements to the documentation could be
> > made, but nothing so drastic as to renaming such a core item.
> >
> >
> > As for the second suggestion to add "a new variable which indicated the
> > actual datetime when the DAG run was generated. call it
> > execution_start_date". It is very unclear what the desired outcome is
> with
> > this.
> >
> > To me "generated" implies creation time, i.e. recorded in the database.
> > However, creation of a DagRun record in the database is a distinct event
> > from when Tasks associated with that DagRun start executing. Plus DagRuns
> > themselves don't actually "run" - Tasks are the only thing that really
> gets
> > run by Airflow.
> >
> > What is actually desired here?
> > - The right bound of the schedule interval?
> > - The time the DagRun was created?
> > - The time that any Tasks associated with a DagRun were first considered
> > by the scheduler?
> > - The time that any Tasks associated with a DagRun were first scheduled?
> > - The time that any Tasks associated with a DagRun were actually started
> > by a worker?
> >
> >
> > The lack of clarity and completeness around these suggestions, alongside
> > inane declarations like "This name won't cause people to get confused" is
> > hardly a good way to get people to take suggestions seriously.
> >
> > Chris
> >
> >
> > On Wed, Sep 26, 2018 at 7:37 PM George Leslie-Waksman <waksman@xxxxxxxxx
> >
> > wrote:
> >
> >> This comes up a lot. I've seen it on this mailing list multiple times
> and
> >> it's something that I have to explicitly call out to every single person
> >> that I've helped train up on Airflow.
> >>
> >> If we take a moment to set aside why things are the way they are, what
> the
> >> documentation says, and how experienced users feel things should behave;
> >> there still remains the fact that a lot of new users get confused by how
> >> "execution_date" works.
> >>
> >> Whether it's a problem, whether we need to do something, and what we
> could
> >> do are all separate questions but I think it's important that we
> >> acknowledge and start from:
> >>
> >> A lot of new users get confused by how "execution_date" works.
> >>
> >> I recognize that some of this is a learning curve issue and some of
> this is
> >> a mindset issue but it begs the question: do enough users benefit from
> the
> >> current structure to justify the harm to new users?
> >>
> >> --George
> >>
> >> On Wed, Sep 26, 2018 at 1:40 PM Brian Greene <
> >> brian@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> >>
> >>> It took a minute to grok, but in the larger context of how af works it
> >>> makes perfect sense the way it is.  Changing something so fundamentally
> >>> breaking to every dag in existence should bring a comparable benefit.
> >>> Beyond the avoiding teaching a concept you disagree with, what benefits
> >>> does the proposal bring to offset the cost of change?
> >>>
> >>> I’m gonna make a meme - “do you even airflow bro?”
> >>>
> >>> Sent from a device with less than stellar autocorrect
> >>>
> >>>> On Sep 26, 2018, at 8:33 AM, Maxime Beauchemin <
> >>> maximebeauchemin@xxxxxxxxx> wrote:
> >>>>
> >>>> I think if you have a functional mindset (as in "functional data
> >>> engineering
> >>>> <
> >>>
> >>
> https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a
> >>>> ")
> >>>> as opposed to a cron mindset, using the left bound of the time
> interval
> >>>> makes a lot of sense. Things like your daily table partition keys
> align
> >>>> with your Airflow execution_date.
> >>>>
> >>>> The main thing is that whatever we do we cannot break backwards
> >>>> compatibility. Offering both views (left bound/right bound), as it's
> >> been
> >>>> proposed before, either as an environment setting or a user personal
> >>>> preference is even more confusing to me personally. Users would have
> to
> >>>> switch context as they help each other or change environments.
> >>>>
> >>>> Also note that your intuition may differ from other people's
> intuition,
> >>> and
> >>>> that "unlearning" something is way harder than learning something.
> >>>>
> >>>> My personal take on this is to make this a rite of passage. This is
> >> just
> >>>> one of the many thing you have to learn when learning Airflow.
> >>>>
> >>>> Max
> >>>>
> >>>>> On Wed, Sep 26, 2018 at 8:18 AM Sam Elamin <hussam.elamin@xxxxxxxxx>
> >>> wrote:
> >>>>>
> >>>>> Hi Bolke
> >>>>>
> >>>>> Speaking as a consultant who is constantly training other teams how
> to
> >>> use
> >>>>> airflow, I do frequently see this confusion.
> >>>>> Another one is how the batch_date is always batch_date + interval or
> >> as
> >>> the
> >>>>> docs make it quite clear
> >>>>>
> >>>>> "*Let’s Repeat That* The scheduler runs your job one
> schedule_interval
> >>>>> AFTER
> >>>>> the start date, at the END of the period."
> >>>>>
> >>>>> Renaming it would make it simpler for newbies, but essentially they
> >> will
> >>>>> need to understand how Airflow behaves, execution_date being the
> batch
> >>>>> execution date not the run_date of the DAG
> >>>>>
> >>>>> I am actually in the process of writing a blog post
> >>>>> <
> >> https://samelamin.github.io/2017/04/27/Building-A-Datapipeline-part1/>
> >>>>> about this which I could use peoples feedback
> >>>>>
> >>>>> If it helps, I find that explaining how backfills work and why they
> >> are
> >>>>> important will drive home what the execution_date is :)
> >>>>>
> >>>>>
> >>>>> Regards
> >>>>> Sam
> >>>>>
> >>>>>
> >>>>>
> >>>>>> On Wed, Sep 26, 2018 at 4:10 PM Bolke de Bruin <bdbruin@xxxxxxxxx>
> >>> wrote:
> >>>>>>
> >>>>>> I dont think this makes sense and I dont that think anyone had a
> real
> >>>>>> issue with this. Execution date has been clearly documented  and is
> >>> part
> >>>>> of
> >>>>>> the core principles of airflow. Renaming will create more confusion.
> >>>>>>
> >>>>>> Please note that I do think that as an anonymous user you cannot
> >> speak
> >>>>> for
> >>>>>> any "new airflow user". That is a contradiction to me.
> >>>>>>
> >>>>>> Thanks
> >>>>>> Bolke
> >>>>>>
> >>>>>> Sent from my iPhone
> >>>>>>
> >>>>>>> On 26 Sep 2018, at 07:59, airflowuser <airflowuser@xxxxxxxxxxxxxx
> >>>>> .INVALID>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> One of the most annoying, hard to understand and against all common
> >>>>>> sense is the execution_date behavior. I assume that any new Airflow
> >>> user
> >>>>>> has been struggling with it.
> >>>>>>> The amount of questions with answers referring to :
> >>>>>> https://airflow.apache.org/scheduler.html?scheduling-triggers  is
> >>>>>> uncountable.
> >>>>>>>
> >>>>>>> Most people mistakenly think that execution_date is the datetime
> >> which
> >>>>>> the DAG started to run.
> >>>>>>>
> >>>>>>> I suggest the following changes:
> >>>>>>> 1. Renaming the execution_date to something else like: run_stamped
> >>>>>> This name won't cause people to get confused.
> >>>>>>> 2. Adding a new variable which indicated the actual datetime when
> >> the
> >>>>>> DAG run was generated. call it execution_start_date. People seem to
> >>> want
> >>>>>> the information when the DAG actually started to be executed/run.
> >>>>>>>
> >>>>>>> This is only naming changes. No need to actual change the behavior
> -
> >>>>>> This will only make things simpler as when user encounter
> >> run_stamped
> >>>>> he
> >>>>>> won't be confused by the name like execution_date
> >>>>>>
> >>>>>
> >>>
> >>
>