git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: execution_date - can we stop the confusion?


Yep.
Aliasing seems a reasonable solution that preserve the structure and make things simpler for new users.

While I agree with everyone that learning a new technology has learning  curve still we can see more and more theologies embrace the user friendly flag.


Sent with ProtonMail Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Saturday, September 29, 2018 9:47 AM, <ash@xxxxxxxxxx> wrote:

> What about (aliasing) execution_date to period_start, and next_execution_date to period_end? Would this help any do we think?
>
> (Though things like ds and ts might still be confusing? This is probably where the OP got the idea for run_stamped from? One step at a time.)
>
> Ash
>
> On 27 September 2018 20:42:07 BST, George Leslie-Waksman waksman@xxxxxxxxx wrote:
>
> > I would like to challenge the notion that "execution_date" is well
> > documented. Looking at airflow.apache.org right now and searching for
> > all
> > references to "execution_date", I find that the only definition of
> > execution_date is, "The execution date of the DAG". There are some
> > other
> > passing references that imply more but nothing explicit.
> > From the documentation, as currently published, it seems reasonable to
> > expect some concurrence between "execution_date" and when a dag
> > executes,
> > especially given the heavy repetition of, "execution_date - The
> > execution
> > date of the DAG".
> > Personally, I think the problem is the word "execution", not with which
> > bound is used to label/define an interval. I think this is especially
> > difficult for people coming to Airflow with a cron background who are
> > not
> > necessarily thinking about intervals.
> > On Thu, Sep 27, 2018 at 11:23 AM Brian Greene <
> > brian@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > > Second use of “inane” on this subject. Brilliant, less combative
> > > response
> > > Chris.
> > > There’s another point.. left bound makes sense to some people, right
> > > bound
> > > to others.
> > > There’s no way to know or measure how “hard” this is to new users, so
> > > even
> > > if the change was made - new name, use right bound... how can you be
> > > sure
> > > you’re not actually confusing a LARGER number of new users from that
> > > point
> > > on.
> > > It’s like left handed versus right handed people, except there’s no
> > > statistical basis for your argument that one group is larger than the
> > > other, or that there would actually be a measurable uptick in
> > > understanding
> > > and usability across the ENTIRE user community.
> > > So your proposal 100% breaks backwards compatibility of code AND
> > > concept,
> > > on anecdotal evidence that it would somehow make usage magically
> > > easier?
> > > Airflow is like a bulldozer made out of scalpels that can fly(not
> > > well,
> > > but it’s possible). A slick dag can accomplish a staggering amount
> > > of work
> > > with the smallest little bit of elegant code. Learning to “think in
> > > airflow” though is so, so much more than understanding execution
> > > date.
> > > That’s barely table stakes in terms of concepts you’ll need to accept
> > > to be
> > > effective with airflow.
> > > Maybe somebody just has a thing against lefty’s? Some kind of
> > > left-bound-thinking conspiracy?
> > > Sent from a device with less than stellar autocorrect
> > >
> > > > On Sep 27, 2018, at 12:56 PM, Chris Palmer chris@xxxxxxxxxxxx
> > > > wrote:
> > >
> > > > While taking a step back makes some sense, we also need to identify
> > > > what
> > >
> > > > the issue is. Simply saying 'execution_date behavior is confusing
> > > > to new
> > >
> > > > users' isn't good enough. What is confusing about it? Is it what it
> > > > represents, or just the name itself?
> > > > There are a number of different timestamps that might be of
> > > > interest,
> > >
> > > > including (but not limited to):
> > > > Identifying timestamp
> > > > For any time interval, there are two natural choices of timestamps
> > > > to
> > >
> > > > represent that interval, the left and right bounds. For Airflow the
> > > > left
> > >
> > > > bound has been chosen, and is called execution_date. For various
> > > > reasons, I
> > > > think that makes a much better choice than the right bound.
> > > > Create/update/delete timestamps
> > > > Timestamps representing when particular database records where
> > > > created,
> > >
> > > > updated and or deleted. I don't believe that Airflow currently
> > > > records
> > >
> > > > these.
> > > > Runtime timestamps
> > > > The timestamps that a task or other process started and stopped.
> > > > Airflow
> > >
> > > > records these for Tasks, but I think the implementation is maybe a
> > > > little
> > >
> > > > lacking for DagRuns.
> > > > So what's the confusion with execution_date? Is it what it
> > > > represents or
> > >
> > > > the name itself?
> > > > I think part of the learning curve with Airflow is understanding
> > > > that
> > >
> > > > execution_date is the left bound of the interval. No matter what
> > > > name you
> > >
> > > > use for the identifying timestamp I think new users will need to
> > > > learn
> > > > what
> > >
> > > > that choice means. Changing the name won't magically make all the
> > > > confusion
> > > > go away.
> > > > While I don't think execution_date is the greatest name in the
> > > > world,
> > > > it's
> > >
> > > > a lot better than the suggested alternative run_stamped. Tasks also
> > > > have
> > > > an
> > >
> > > > identifying timestamp, and if I saw run_stamped on a Task I would
> > > > have no
> > >
> > > > idea what it means (stamped by what?).
> > > > While there may be better names than execution_date, I don't think
> > > > they
> > > > are
> > >
> > > > so much better that it is worth the effort to overhaul such an
> > > > integral
> > >
> > > > part of Airflow. Maybe some improvements to the documentation could
> > > > be
> > >
> > > > made, but nothing so drastic as to renaming such a core item.
> > > > As for the second suggestion to add "a new variable which indicated
> > > > the
> > >
> > > > actual datetime when the DAG run was generated. call it
> > > > execution_start_date". It is very unclear what the desired outcome
> > > > is
> > > > with
> > >
> > > > this.
> > > > To me "generated" implies creation time, i.e. recorded in the
> > > > database.
> > >
> > > > However, creation of a DagRun record in the database is a distinct
> > > > event
> > >
> > > > from when Tasks associated with that DagRun start executing. Plus
> > > > DagRuns
> > >
> > > > themselves don't actually "run" - Tasks are the only thing that
> > > > really
> > > > gets
> > >
> > > > run by Airflow.
> > > > What is actually desired here?
> > > >
> > > > -   The right bound of the schedule interval?
> > > > -   The time the DagRun was created?
> > > > -   The time that any Tasks associated with a DagRun were first
> > > >     considered
> > > >
> > >
> > > > by the scheduler?
> > > >
> > > > -   The time that any Tasks associated with a DagRun were first
> > > >     scheduled?
> > > >
> > >
> > > > -   The time that any Tasks associated with a DagRun were actually
> > > >     started
> > > >
> > >
> > > > by a worker?
> > > > The lack of clarity and completeness around these suggestions,
> > > > alongside
> > >
> > > > inane declarations like "This name won't cause people to get
> > > > confused" is
> > >
> > > > hardly a good way to get people to take suggestions seriously.
> > > > Chris
> > > > On Wed, Sep 26, 2018 at 7:37 PM George Leslie-Waksman
> > > > <waksman@xxxxxxxxx
> > >
> > > > wrote:
> > > >
> > > > > This comes up a lot. I've seen it on this mailing list multiple
> > > > > times
> > > > > and
> > >
> > > > > it's something that I have to explicitly call out to every single
> > > > > person
> > >
> > > > > that I've helped train up on Airflow.
> > > > > If we take a moment to set aside why things are the way they are,
> > > > > what
> > > > > the
> > >
> > > > > documentation says, and how experienced users feel things should
> > > > > behave;
> > >
> > > > > there still remains the fact that a lot of new users get confused
> > > > > by how
> > >
> > > > > "execution_date" works.
> > > > > Whether it's a problem, whether we need to do something, and what
> > > > > we
> > > > > could
> > >
> > > > > do are all separate questions but I think it's important that we
> > > > > acknowledge and start from:
> > > > > A lot of new users get confused by how "execution_date" works.
> > > > > I recognize that some of this is a learning curve issue and some
> > > > > of
> > > > > this is
> > >
> > > > > a mindset issue but it begs the question: do enough users benefit
> > > > > from
> > > > > the
> > >
> > > > > current structure to justify the harm to new users?
> > > > > --George
> > > > > On Wed, Sep 26, 2018 at 1:40 PM Brian Greene <
> > > > > brian@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> > > > >
> > > > > > It took a minute to grok, but in the larger context of how af
> > > > > > works it
> > >
> > > > > > makes perfect sense the way it is. Changing something so
> > > > > > fundamentally
> > >
> > > > > > breaking to every dag in existence should bring a comparable
> > > > > > benefit.
> > >
> > > > > > Beyond the avoiding teaching a concept you disagree with, what
> > > > > > benefits
> > >
> > > > > > does the proposal bring to offset the cost of change?
> > > > > > I’m gonna make a meme - “do you even airflow bro?”
> > > > > > Sent from a device with less than stellar autocorrect
> > > > > >
> > > > > > > On Sep 26, 2018, at 8:33 AM, Maxime Beauchemin <
> > > > > > > maximebeauchemin@xxxxxxxxx> wrote:
> > > > > > > I think if you have a functional mindset (as in "functional data
> > > > > > > engineering
> > > > > > > <
> >
> > https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a
> >
> > > > > > > ")
> > > > > > > as opposed to a cron mindset, using the left bound of the time
> > > > > > > interval
> > > >
> > > > > > > makes a lot of sense. Things like your daily table partition
> > > > > > > keys
> > > > > > > align
> > >
> > > > > > > with your Airflow execution_date.
> > > > > > > The main thing is that whatever we do we cannot break backwards
> > > > > > > compatibility. Offering both views (left bound/right bound), as
> > > > > > > it's
> > >
> > > > > been
> > > > >
> > > > > > > proposed before, either as an environment setting or a user
> > > > > > > personal
> > >
> > > > > > > preference is even more confusing to me personally. Users would
> > > > > > > have
> > > > > > > to
> > >
> > > > > > > switch context as they help each other or change environments.
> > > > > > > Also note that your intuition may differ from other people's
> > > > > > > intuition,
> > > >
> > > > > > and
> > > > > >
> > > > > > > that "unlearning" something is way harder than learning
> > > > > > > something.
> > >
> > > > > > > My personal take on this is to make this a rite of passage. This
> > > > > > > is
> > >
> > > > > just
> > > > >
> > > > > > > one of the many thing you have to learn when learning Airflow.
> > > > > > > Max
> > > > > > >
> > > > > > > > On Wed, Sep 26, 2018 at 8:18 AM Sam Elamin
> > > > > > > > hussam.elamin@xxxxxxxxx
> > >
> > > > > > wrote:
> > > > > >
> > > > > > > > Hi Bolke
> > > > > > > > Speaking as a consultant who is constantly training other teams
> > > > > > > > how
> > > > > > > > to
> > >
> > > > > > use
> > > > > >
> > > > > > > > airflow, I do frequently see this confusion.
> > > > > > > > Another one is how the batch_date is always batch_date +
> > > > > > > > interval or
> > >
> > > > > as
> > > > >
> > > > > > the
> > > > > >
> > > > > > > > docs make it quite clear
> > > > > > > > "Let’s Repeat That The scheduler runs your job one
> > > > > > > > schedule_interval
> > > >
> > > > > > > > AFTER
> > > > > > > > the start date, at the END of the period."
> > > > > > > > Renaming it would make it simpler for newbies, but essentially
> > > > > > > > they
> > >
> > > > > will
> > > > >
> > > > > > > > need to understand how Airflow behaves, execution_date being
> > > > > > > > the
> > > > > > > > batch
> > >
> > > > > > > > execution date not the run_date of the DAG
> > > > > > > > I am actually in the process of writing a blog post
> > > > > > > > <
> >
> > https://samelamin.github.io/2017/04/27/Building-A-Datapipeline-part1/>
> >
> > > > > > > > about this which I could use peoples feedback
> > > > > > > > If it helps, I find that explaining how backfills work and why
> > > > > > > > they
> > >
> > > > > are
> > > > >
> > > > > > > > important will drive home what the execution_date is :)
> > > > > > > > Regards
> > > > > > > > Sam
> > > > > > > >
> > > > > > > > > On Wed, Sep 26, 2018 at 4:10 PM Bolke de Bruin
> > > > > > > > > bdbruin@xxxxxxxxx
> > >
> > > > > > wrote:
> > > > > >
> > > > > > > > > I dont think this makes sense and I dont that think anyone had
> > > > > > > > > a
> > > > > > > > > real
> > >
> > > > > > > > > issue with this. Execution date has been clearly documented
> > > > > > > > > and is
> > >
> > > > > > part
> > > > > >
> > > > > > > > of
> > > > > > > >
> > > > > > > > > the core principles of airflow. Renaming will create more
> > > > > > > > > confusion.
> > >
> > > > > > > > > Please note that I do think that as an anonymous user you
> > > > > > > > > cannot
> > >
> > > > > speak
> > > > >
> > > > > > > > for
> > > > > > > >
> > > > > > > > > any "new airflow user". That is a contradiction to me.
> > > > > > > > > Thanks
> > > > > > > > > Bolke
> > > > > > > > > Sent from my iPhone
> > > > > > > > >
> > > > > > > > > > On 26 Sep 2018, at 07:59, airflowuser
> > > > > > > > > > <airflowuser@xxxxxxxxxxxxxx
> > >
> > > > > > > > .INVALID>
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > One of the most annoying, hard to understand and against all
> > > > > > > > > > common
> > >
> > > > > > > > > sense is the execution_date behavior. I assume that any new
> > > > > > > > > Airflow
> > >
> > > > > > user
> > > > > >
> > > > > > > > > has been struggling with it.
> > > > > > > > >
> > > > > > > > > > The amount of questions with answers referring to :
> > > > > > > > > > https://airflow.apache.org/scheduler.html?scheduling-triggers
> > > > > > > > > > is
> > >
> > > > > > > > > uncountable.
> > > > > > > > >
> > > > > > > > > > Most people mistakenly think that execution_date is the
> > > > > > > > > > datetime
> > >
> > > > > which
> > > > >
> > > > > > > > > the DAG started to run.
> > > > > > > > >
> > > > > > > > > > I suggest the following changes:
> > > > > > > > > >
> > > > > > > > > > 1.  Renaming the execution_date to something else like:
> > > > > > > > > >     run_stamped
> > > > > > > > > >
> > >
> > > > > > > > > This name won't cause people to get confused.
> > > > > > > > >
> > > > > > > > > > 2.  Adding a new variable which indicated the actual datetime
> > > > > > > > > >     when
> > > > > > > > > >
> > >
> > > > > the
> > > > >
> > > > > > > > > DAG run was generated. call it execution_start_date. People
> > > > > > > > > seem to
> > >
> > > > > > want
> > > > > >
> > > > > > > > > the information when the DAG actually started to be
> > > > > > > > > executed/run.
> > >
> > > > > > > > > > This is only naming changes. No need to actual change the
> > > > > > > > > > behavior
> > >
> > > -
> > >
> > > > > > > > > This will only make things simpler as when user encounter
> > > > > > > > > run_stamped
> > > > > >
> > > > > > > > he
> > > > > > > >
> > > > > > > > > won't be confused by the name like execution_date