git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: execution_date - can we stop the confusion?


What about (aliasing) execution_date to period_start, and next_execution_date to period_end? Would this help any do we think?

(Though things like ds and ts might still be confusing? This is probably where the OP got the idea for run_stamped from? One step at a time.)

Ash

On 27 September 2018 20:42:07 BST, George Leslie-Waksman <waksman@xxxxxxxxx> wrote:
>I would like to challenge the notion that "execution_date" is well
>documented. Looking at airflow.apache.org right now and searching for
>all
>references to "execution_date", I find that the only definition of
>execution_date is, "The execution date of the DAG". There are some
>other
>passing references that imply more but nothing explicit.
>
>From the documentation, as currently published, it seems reasonable to
>expect some concurrence between "execution_date" and when a dag
>executes,
>especially given the heavy repetition of, "execution_date - The
>execution
>date of the DAG".
>
>Personally, I think the problem is the word "execution", not with which
>bound is used to label/define an interval. I think this is especially
>difficult for people coming to Airflow with a cron background who are
>not
>necessarily thinking about intervals.
>
>On Thu, Sep 27, 2018 at 11:23 AM Brian Greene <
>brian@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>
>> Second use of “inane” on this subject.  Brilliant, less combative
>response
>> Chris.
>>
>> There’s another point.. left bound makes sense to some people, right
>bound
>> to others.
>>
>> There’s no way to know or measure how “hard” this is to new users, so
>even
>> if the change was made - new name, use right bound... how can you be
>sure
>> you’re not actually confusing a LARGER number of new users from that
>point
>> on.
>>
>> It’s like left handed versus right handed people, except there’s no
>> statistical basis for your argument that one group is larger than the
>> other, or that there would actually be a measurable uptick in
>understanding
>> and usability across the ENTIRE user community.
>>
>> So your proposal 100% breaks backwards compatibility of code AND
>concept,
>> on anecdotal evidence that it would somehow make usage magically
>easier?
>>
>> Airflow is like a bulldozer made out of scalpels that can fly(not
>well,
>> but it’s possible).  A slick dag can accomplish a staggering amount
>of work
>> with the smallest little bit of elegant code.  Learning to “think in
>> airflow” though is so, so much more than understanding execution
>date.
>> That’s barely table stakes in terms of concepts you’ll need to accept
>to be
>> effective with airflow.
>>
>> Maybe somebody just has a thing against lefty’s?  Some kind of
>> left-bound-thinking conspiracy?
>>
>> Sent from a device with less than stellar autocorrect
>>
>> > On Sep 27, 2018, at 12:56 PM, Chris Palmer <chris@xxxxxxxxxxxx>
>wrote:
>> >
>> > While taking a step back makes some sense, we also need to identify
>what
>> > the issue is. Simply saying 'execution_date behavior is confusing
>to new
>> > users' isn't good enough. What is confusing about it? Is it what it
>> > represents, or just the name itself?
>> >
>> > There are a number of different timestamps that might be of
>interest,
>> > including (but not limited to):
>> >
>> > *Identifying timestamp*
>> > For any time interval, there are two natural choices of timestamps
>to
>> > represent that interval, the left and right bounds. For Airflow the
>left
>> > bound has been chosen, and is called execution_date. For various
>> reasons, I
>> > think that makes a much better choice than the right bound.
>> >
>> > *Create/update/delete timestamps*
>> > Timestamps representing when particular database records where
>created,
>> > updated and or deleted. I don't believe that Airflow currently
>records
>> > these.
>> >
>> > *Runtime timestamps*
>> > The timestamps that a task or other process started and stopped.
>Airflow
>> > records these for Tasks, but I think the implementation is maybe a
>little
>> > lacking for DagRuns.
>> >
>> >
>> > So what's the confusion with execution_date? Is it what it
>represents or
>> > the name itself?
>> >
>> > I think part of the learning curve with Airflow is understanding
>that
>> > execution_date is the left bound of the interval. No matter what
>name you
>> > use for the identifying timestamp I think new users will need to
>learn
>> what
>> > that choice means. Changing the name won't magically make all the
>> confusion
>> > go away.
>> >
>> > While I don't think execution_date is the greatest name in the
>world,
>> it's
>> > a lot better than the suggested alternative run_stamped. Tasks also
>have
>> an
>> > identifying timestamp, and if I saw run_stamped on a Task I would
>have no
>> > idea what it means (stamped by what?).
>> >
>> > While there may be better names than execution_date, I don't think
>they
>> are
>> > so much better that it is worth the effort to overhaul such an
>integral
>> > part of Airflow. Maybe some improvements to the documentation could
>be
>> > made, but nothing so drastic as to renaming such a core item.
>> >
>> >
>> > As for the second suggestion to add "a new variable which indicated
>the
>> > actual datetime when the DAG run was generated. call it
>> > execution_start_date". It is very unclear what the desired outcome
>is
>> with
>> > this.
>> >
>> > To me "generated" implies creation time, i.e. recorded in the
>database.
>> > However, creation of a DagRun record in the database is a distinct
>event
>> > from when Tasks associated with that DagRun start executing. Plus
>DagRuns
>> > themselves don't actually "run" - Tasks are the only thing that
>really
>> gets
>> > run by Airflow.
>> >
>> > What is actually desired here?
>> > - The right bound of the schedule interval?
>> > - The time the DagRun was created?
>> > - The time that any Tasks associated with a DagRun were first
>considered
>> > by the scheduler?
>> > - The time that any Tasks associated with a DagRun were first
>scheduled?
>> > - The time that any Tasks associated with a DagRun were actually
>started
>> > by a worker?
>> >
>> >
>> > The lack of clarity and completeness around these suggestions,
>alongside
>> > inane declarations like "This name won't cause people to get
>confused" is
>> > hardly a good way to get people to take suggestions seriously.
>> >
>> > Chris
>> >
>> >
>> > On Wed, Sep 26, 2018 at 7:37 PM George Leslie-Waksman
><waksman@xxxxxxxxx
>> >
>> > wrote:
>> >
>> >> This comes up a lot. I've seen it on this mailing list multiple
>times
>> and
>> >> it's something that I have to explicitly call out to every single
>person
>> >> that I've helped train up on Airflow.
>> >>
>> >> If we take a moment to set aside why things are the way they are,
>what
>> the
>> >> documentation says, and how experienced users feel things should
>behave;
>> >> there still remains the fact that a lot of new users get confused
>by how
>> >> "execution_date" works.
>> >>
>> >> Whether it's a problem, whether we need to do something, and what
>we
>> could
>> >> do are all separate questions but I think it's important that we
>> >> acknowledge and start from:
>> >>
>> >> A lot of new users get confused by how "execution_date" works.
>> >>
>> >> I recognize that some of this is a learning curve issue and some
>of
>> this is
>> >> a mindset issue but it begs the question: do enough users benefit
>from
>> the
>> >> current structure to justify the harm to new users?
>> >>
>> >> --George
>> >>
>> >> On Wed, Sep 26, 2018 at 1:40 PM Brian Greene <
>> >> brian@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>> >>
>> >>> It took a minute to grok, but in the larger context of how af
>works it
>> >>> makes perfect sense the way it is.  Changing something so
>fundamentally
>> >>> breaking to every dag in existence should bring a comparable
>benefit.
>> >>> Beyond the avoiding teaching a concept you disagree with, what
>benefits
>> >>> does the proposal bring to offset the cost of change?
>> >>>
>> >>> I’m gonna make a meme - “do you even airflow bro?”
>> >>>
>> >>> Sent from a device with less than stellar autocorrect
>> >>>
>> >>>> On Sep 26, 2018, at 8:33 AM, Maxime Beauchemin <
>> >>> maximebeauchemin@xxxxxxxxx> wrote:
>> >>>>
>> >>>> I think if you have a functional mindset (as in "functional data
>> >>> engineering
>> >>>> <
>> >>>
>> >>
>>
>https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a
>> >>>> ")
>> >>>> as opposed to a cron mindset, using the left bound of the time
>> interval
>> >>>> makes a lot of sense. Things like your daily table partition
>keys
>> align
>> >>>> with your Airflow execution_date.
>> >>>>
>> >>>> The main thing is that whatever we do we cannot break backwards
>> >>>> compatibility. Offering both views (left bound/right bound), as
>it's
>> >> been
>> >>>> proposed before, either as an environment setting or a user
>personal
>> >>>> preference is even more confusing to me personally. Users would
>have
>> to
>> >>>> switch context as they help each other or change environments.
>> >>>>
>> >>>> Also note that your intuition may differ from other people's
>> intuition,
>> >>> and
>> >>>> that "unlearning" something is way harder than learning
>something.
>> >>>>
>> >>>> My personal take on this is to make this a rite of passage. This
>is
>> >> just
>> >>>> one of the many thing you have to learn when learning Airflow.
>> >>>>
>> >>>> Max
>> >>>>
>> >>>>> On Wed, Sep 26, 2018 at 8:18 AM Sam Elamin
><hussam.elamin@xxxxxxxxx>
>> >>> wrote:
>> >>>>>
>> >>>>> Hi Bolke
>> >>>>>
>> >>>>> Speaking as a consultant who is constantly training other teams
>how
>> to
>> >>> use
>> >>>>> airflow, I do frequently see this confusion.
>> >>>>> Another one is how the batch_date is always batch_date +
>interval or
>> >> as
>> >>> the
>> >>>>> docs make it quite clear
>> >>>>>
>> >>>>> "*Let’s Repeat That* The scheduler runs your job one
>> schedule_interval
>> >>>>> AFTER
>> >>>>> the start date, at the END of the period."
>> >>>>>
>> >>>>> Renaming it would make it simpler for newbies, but essentially
>they
>> >> will
>> >>>>> need to understand how Airflow behaves, execution_date being
>the
>> batch
>> >>>>> execution date not the run_date of the DAG
>> >>>>>
>> >>>>> I am actually in the process of writing a blog post
>> >>>>> <
>> >>
>https://samelamin.github.io/2017/04/27/Building-A-Datapipeline-part1/>
>> >>>>> about this which I could use peoples feedback
>> >>>>>
>> >>>>> If it helps, I find that explaining how backfills work and why
>they
>> >> are
>> >>>>> important will drive home what the execution_date is :)
>> >>>>>
>> >>>>>
>> >>>>> Regards
>> >>>>> Sam
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>> On Wed, Sep 26, 2018 at 4:10 PM Bolke de Bruin
><bdbruin@xxxxxxxxx>
>> >>> wrote:
>> >>>>>>
>> >>>>>> I dont think this makes sense and I dont that think anyone had
>a
>> real
>> >>>>>> issue with this. Execution date has been clearly documented 
>and is
>> >>> part
>> >>>>> of
>> >>>>>> the core principles of airflow. Renaming will create more
>confusion.
>> >>>>>>
>> >>>>>> Please note that I do think that as an anonymous user you
>cannot
>> >> speak
>> >>>>> for
>> >>>>>> any "new airflow user". That is a contradiction to me.
>> >>>>>>
>> >>>>>> Thanks
>> >>>>>> Bolke
>> >>>>>>
>> >>>>>> Sent from my iPhone
>> >>>>>>
>> >>>>>>> On 26 Sep 2018, at 07:59, airflowuser
><airflowuser@xxxxxxxxxxxxxx
>> >>>>> .INVALID>
>> >>>>>> wrote:
>> >>>>>>>
>> >>>>>>> One of the most annoying, hard to understand and against all
>common
>> >>>>>> sense is the execution_date behavior. I assume that any new
>Airflow
>> >>> user
>> >>>>>> has been struggling with it.
>> >>>>>>> The amount of questions with answers referring to :
>> >>>>>> https://airflow.apache.org/scheduler.html?scheduling-triggers 
>is
>> >>>>>> uncountable.
>> >>>>>>>
>> >>>>>>> Most people mistakenly think that execution_date is the
>datetime
>> >> which
>> >>>>>> the DAG started to run.
>> >>>>>>>
>> >>>>>>> I suggest the following changes:
>> >>>>>>> 1. Renaming the execution_date to something else like:
>run_stamped
>> >>>>>> This name won't cause people to get confused.
>> >>>>>>> 2. Adding a new variable which indicated the actual datetime
>when
>> >> the
>> >>>>>> DAG run was generated. call it execution_start_date. People
>seem to
>> >>> want
>> >>>>>> the information when the DAG actually started to be
>executed/run.
>> >>>>>>>
>> >>>>>>> This is only naming changes. No need to actual change the
>behavior
>> -
>> >>>>>> This will only make things simpler as when user encounter
>> >> run_stamped
>> >>>>> he
>> >>>>>> won't be confused by the name like execution_date
>> >>>>>>
>> >>>>>
>> >>>
>> >>
>>