git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: execution_date - can we stop the confusion?


Changing terms or aliasing may both introduce another set of confusions.

Refining the documentation systematically may be a more feasible solution
to this sort of issues? Like having “execution_date” in “Concepts” section,
or having a dedicated section named “Vocabularies” to list all potentially
confusing terms?

Thanks.

XD


On Mon, Oct 1, 2018 at 23:51 Maxime Beauchemin <maximebeauchemin@xxxxxxxxx>
wrote:

> I'm not against aliasing personally.
>
> The downside is that it creates more vocabulary overall and most users will
> need to learn the mapping of the given aliases at some point in their
> learning curve anyways. Only users in environments free of `execution_date`
> will benefit from less confusion, and it's likely that the pre-aliased
> terms will live on for perpetuity (habit + legacy code).
>
> I'm assuming that the scope of the aliasing would be BaseOperator, the
> tutorial, examples, the web UI and CLI. If we start using `period_start` in
> those user-facing locations, it creates a bit of a dissonance with the
> object naming in the code base and database. Contributors will really need
> to understand that aliasing, with `period_start` and `execution_date`
> potentially being used interchangeably in the codebase.
>
> I don't think anyone is pushing for this, but I feel strongly that any
> campaign to deprecate the original interface would be a giant waste of
> effort and time and alienate the community as whole.
>
> Max
>
> On Sun, Sep 30, 2018 at 1:15 AM airflowuser
> <airflowuser@xxxxxxxxxxxxxx.invalid> wrote:
>
> > Yep.
> > Aliasing seems a reasonable solution that preserve the structure and make
> > things simpler for new users.
> >
> > While I agree with everyone that learning a new technology has learning
> > curve still we can see more and more theologies embrace the user friendly
> > flag.
> >
> >
> > Sent with ProtonMail Secure Email.
> >
> > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > On Saturday, September 29, 2018 9:47 AM, <ash@xxxxxxxxxx> wrote:
> >
> > > What about (aliasing) execution_date to period_start, and
> > next_execution_date to period_end? Would this help any do we think?
> > >
> > > (Though things like ds and ts might still be confusing? This is
> probably
> > where the OP got the idea for run_stamped from? One step at a time.)
> > >
> > > Ash
> > >
> > > On 27 September 2018 20:42:07 BST, George Leslie-Waksman
> > waksman@xxxxxxxxx wrote:
> > >
> > > > I would like to challenge the notion that "execution_date" is well
> > > > documented. Looking at airflow.apache.org right now and searching
> for
> > > > all
> > > > references to "execution_date", I find that the only definition of
> > > > execution_date is, "The execution date of the DAG". There are some
> > > > other
> > > > passing references that imply more but nothing explicit.
> > > > From the documentation, as currently published, it seems reasonable
> to
> > > > expect some concurrence between "execution_date" and when a dag
> > > > executes,
> > > > especially given the heavy repetition of, "execution_date - The
> > > > execution
> > > > date of the DAG".
> > > > Personally, I think the problem is the word "execution", not with
> which
> > > > bound is used to label/define an interval. I think this is especially
> > > > difficult for people coming to Airflow with a cron background who are
> > > > not
> > > > necessarily thinking about intervals.
> > > > On Thu, Sep 27, 2018 at 11:23 AM Brian Greene <
> > > > brian@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> > > >
> > > > > Second use of “inane” on this subject. Brilliant, less combative
> > > > > response
> > > > > Chris.
> > > > > There’s another point.. left bound makes sense to some people,
> right
> > > > > bound
> > > > > to others.
> > > > > There’s no way to know or measure how “hard” this is to new users,
> so
> > > > > even
> > > > > if the change was made - new name, use right bound... how can you
> be
> > > > > sure
> > > > > you’re not actually confusing a LARGER number of new users from
> that
> > > > > point
> > > > > on.
> > > > > It’s like left handed versus right handed people, except there’s no
> > > > > statistical basis for your argument that one group is larger than
> the
> > > > > other, or that there would actually be a measurable uptick in
> > > > > understanding
> > > > > and usability across the ENTIRE user community.
> > > > > So your proposal 100% breaks backwards compatibility of code AND
> > > > > concept,
> > > > > on anecdotal evidence that it would somehow make usage magically
> > > > > easier?
> > > > > Airflow is like a bulldozer made out of scalpels that can fly(not
> > > > > well,
> > > > > but it’s possible). A slick dag can accomplish a staggering amount
> > > > > of work
> > > > > with the smallest little bit of elegant code. Learning to “think in
> > > > > airflow” though is so, so much more than understanding execution
> > > > > date.
> > > > > That’s barely table stakes in terms of concepts you’ll need to
> accept
> > > > > to be
> > > > > effective with airflow.
> > > > > Maybe somebody just has a thing against lefty’s? Some kind of
> > > > > left-bound-thinking conspiracy?
> > > > > Sent from a device with less than stellar autocorrect
> > > > >
> > > > > > On Sep 27, 2018, at 12:56 PM, Chris Palmer chris@xxxxxxxxxxxx
> > > > > > wrote:
> > > > >
> > > > > > While taking a step back makes some sense, we also need to
> identify
> > > > > > what
> > > > >
> > > > > > the issue is. Simply saying 'execution_date behavior is confusing
> > > > > > to new
> > > > >
> > > > > > users' isn't good enough. What is confusing about it? Is it what
> it
> > > > > > represents, or just the name itself?
> > > > > > There are a number of different timestamps that might be of
> > > > > > interest,
> > > > >
> > > > > > including (but not limited to):
> > > > > > Identifying timestamp
> > > > > > For any time interval, there are two natural choices of
> timestamps
> > > > > > to
> > > > >
> > > > > > represent that interval, the left and right bounds. For Airflow
> the
> > > > > > left
> > > > >
> > > > > > bound has been chosen, and is called execution_date. For various
> > > > > > reasons, I
> > > > > > think that makes a much better choice than the right bound.
> > > > > > Create/update/delete timestamps
> > > > > > Timestamps representing when particular database records where
> > > > > > created,
> > > > >
> > > > > > updated and or deleted. I don't believe that Airflow currently
> > > > > > records
> > > > >
> > > > > > these.
> > > > > > Runtime timestamps
> > > > > > The timestamps that a task or other process started and stopped.
> > > > > > Airflow
> > > > >
> > > > > > records these for Tasks, but I think the implementation is maybe
> a
> > > > > > little
> > > > >
> > > > > > lacking for DagRuns.
> > > > > > So what's the confusion with execution_date? Is it what it
> > > > > > represents or
> > > > >
> > > > > > the name itself?
> > > > > > I think part of the learning curve with Airflow is understanding
> > > > > > that
> > > > >
> > > > > > execution_date is the left bound of the interval. No matter what
> > > > > > name you
> > > > >
> > > > > > use for the identifying timestamp I think new users will need to
> > > > > > learn
> > > > > > what
> > > > >
> > > > > > that choice means. Changing the name won't magically make all the
> > > > > > confusion
> > > > > > go away.
> > > > > > While I don't think execution_date is the greatest name in the
> > > > > > world,
> > > > > > it's
> > > > >
> > > > > > a lot better than the suggested alternative run_stamped. Tasks
> also
> > > > > > have
> > > > > > an
> > > > >
> > > > > > identifying timestamp, and if I saw run_stamped on a Task I would
> > > > > > have no
> > > > >
> > > > > > idea what it means (stamped by what?).
> > > > > > While there may be better names than execution_date, I don't
> think
> > > > > > they
> > > > > > are
> > > > >
> > > > > > so much better that it is worth the effort to overhaul such an
> > > > > > integral
> > > > >
> > > > > > part of Airflow. Maybe some improvements to the documentation
> could
> > > > > > be
> > > > >
> > > > > > made, but nothing so drastic as to renaming such a core item.
> > > > > > As for the second suggestion to add "a new variable which
> indicated
> > > > > > the
> > > > >
> > > > > > actual datetime when the DAG run was generated. call it
> > > > > > execution_start_date". It is very unclear what the desired
> outcome
> > > > > > is
> > > > > > with
> > > > >
> > > > > > this.
> > > > > > To me "generated" implies creation time, i.e. recorded in the
> > > > > > database.
> > > > >
> > > > > > However, creation of a DagRun record in the database is a
> distinct
> > > > > > event
> > > > >
> > > > > > from when Tasks associated with that DagRun start executing. Plus
> > > > > > DagRuns
> > > > >
> > > > > > themselves don't actually "run" - Tasks are the only thing that
> > > > > > really
> > > > > > gets
> > > > >
> > > > > > run by Airflow.
> > > > > > What is actually desired here?
> > > > > >
> > > > > > -   The right bound of the schedule interval?
> > > > > > -   The time the DagRun was created?
> > > > > > -   The time that any Tasks associated with a DagRun were first
> > > > > >     considered
> > > > > >
> > > > >
> > > > > > by the scheduler?
> > > > > >
> > > > > > -   The time that any Tasks associated with a DagRun were first
> > > > > >     scheduled?
> > > > > >
> > > > >
> > > > > > -   The time that any Tasks associated with a DagRun were
> actually
> > > > > >     started
> > > > > >
> > > > >
> > > > > > by a worker?
> > > > > > The lack of clarity and completeness around these suggestions,
> > > > > > alongside
> > > > >
> > > > > > inane declarations like "This name won't cause people to get
> > > > > > confused" is
> > > > >
> > > > > > hardly a good way to get people to take suggestions seriously.
> > > > > > Chris
> > > > > > On Wed, Sep 26, 2018 at 7:37 PM George Leslie-Waksman
> > > > > > <waksman@xxxxxxxxx
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > This comes up a lot. I've seen it on this mailing list multiple
> > > > > > > times
> > > > > > > and
> > > > >
> > > > > > > it's something that I have to explicitly call out to every
> single
> > > > > > > person
> > > > >
> > > > > > > that I've helped train up on Airflow.
> > > > > > > If we take a moment to set aside why things are the way they
> are,
> > > > > > > what
> > > > > > > the
> > > > >
> > > > > > > documentation says, and how experienced users feel things
> should
> > > > > > > behave;
> > > > >
> > > > > > > there still remains the fact that a lot of new users get
> confused
> > > > > > > by how
> > > > >
> > > > > > > "execution_date" works.
> > > > > > > Whether it's a problem, whether we need to do something, and
> what
> > > > > > > we
> > > > > > > could
> > > > >
> > > > > > > do are all separate questions but I think it's important that
> we
> > > > > > > acknowledge and start from:
> > > > > > > A lot of new users get confused by how "execution_date" works.
> > > > > > > I recognize that some of this is a learning curve issue and
> some
> > > > > > > of
> > > > > > > this is
> > > > >
> > > > > > > a mindset issue but it begs the question: do enough users
> benefit
> > > > > > > from
> > > > > > > the
> > > > >
> > > > > > > current structure to justify the harm to new users?
> > > > > > > --George
> > > > > > > On Wed, Sep 26, 2018 at 1:40 PM Brian Greene <
> > > > > > > brian@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> > > > > > >
> > > > > > > > It took a minute to grok, but in the larger context of how af
> > > > > > > > works it
> > > > >
> > > > > > > > makes perfect sense the way it is. Changing something so
> > > > > > > > fundamentally
> > > > >
> > > > > > > > breaking to every dag in existence should bring a comparable
> > > > > > > > benefit.
> > > > >
> > > > > > > > Beyond the avoiding teaching a concept you disagree with,
> what
> > > > > > > > benefits
> > > > >
> > > > > > > > does the proposal bring to offset the cost of change?
> > > > > > > > I’m gonna make a meme - “do you even airflow bro?”
> > > > > > > > Sent from a device with less than stellar autocorrect
> > > > > > > >
> > > > > > > > > On Sep 26, 2018, at 8:33 AM, Maxime Beauchemin <
> > > > > > > > > maximebeauchemin@xxxxxxxxx> wrote:
> > > > > > > > > I think if you have a functional mindset (as in "functional
> > data
> > > > > > > > > engineering
> > > > > > > > > <
> > > >
> > > >
> >
> https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a
> > > >
> > > > > > > > > ")
> > > > > > > > > as opposed to a cron mindset, using the left bound of the
> > time
> > > > > > > > > interval
> > > > > >
> > > > > > > > > makes a lot of sense. Things like your daily table
> partition
> > > > > > > > > keys
> > > > > > > > > align
> > > > >
> > > > > > > > > with your Airflow execution_date.
> > > > > > > > > The main thing is that whatever we do we cannot break
> > backwards
> > > > > > > > > compatibility. Offering both views (left bound/right
> bound),
> > as
> > > > > > > > > it's
> > > > >
> > > > > > > been
> > > > > > >
> > > > > > > > > proposed before, either as an environment setting or a user
> > > > > > > > > personal
> > > > >
> > > > > > > > > preference is even more confusing to me personally. Users
> > would
> > > > > > > > > have
> > > > > > > > > to
> > > > >
> > > > > > > > > switch context as they help each other or change
> > environments.
> > > > > > > > > Also note that your intuition may differ from other
> people's
> > > > > > > > > intuition,
> > > > > >
> > > > > > > > and
> > > > > > > >
> > > > > > > > > that "unlearning" something is way harder than learning
> > > > > > > > > something.
> > > > >
> > > > > > > > > My personal take on this is to make this a rite of passage.
> > This
> > > > > > > > > is
> > > > >
> > > > > > > just
> > > > > > >
> > > > > > > > > one of the many thing you have to learn when learning
> > Airflow.
> > > > > > > > > Max
> > > > > > > > >
> > > > > > > > > > On Wed, Sep 26, 2018 at 8:18 AM Sam Elamin
> > > > > > > > > > hussam.elamin@xxxxxxxxx
> > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > > Hi Bolke
> > > > > > > > > > Speaking as a consultant who is constantly training other
> > teams
> > > > > > > > > > how
> > > > > > > > > > to
> > > > >
> > > > > > > > use
> > > > > > > >
> > > > > > > > > > airflow, I do frequently see this confusion.
> > > > > > > > > > Another one is how the batch_date is always batch_date +
> > > > > > > > > > interval or
> > > > >
> > > > > > > as
> > > > > > >
> > > > > > > > the
> > > > > > > >
> > > > > > > > > > docs make it quite clear
> > > > > > > > > > "Let’s Repeat That The scheduler runs your job one
> > > > > > > > > > schedule_interval
> > > > > >
> > > > > > > > > > AFTER
> > > > > > > > > > the start date, at the END of the period."
> > > > > > > > > > Renaming it would make it simpler for newbies, but
> > essentially
> > > > > > > > > > they
> > > > >
> > > > > > > will
> > > > > > >
> > > > > > > > > > need to understand how Airflow behaves, execution_date
> > being
> > > > > > > > > > the
> > > > > > > > > > batch
> > > > >
> > > > > > > > > > execution date not the run_date of the DAG
> > > > > > > > > > I am actually in the process of writing a blog post
> > > > > > > > > > <
> > > >
> > > >
> https://samelamin.github.io/2017/04/27/Building-A-Datapipeline-part1/>
> > > >
> > > > > > > > > > about this which I could use peoples feedback
> > > > > > > > > > If it helps, I find that explaining how backfills work
> and
> > why
> > > > > > > > > > they
> > > > >
> > > > > > > are
> > > > > > >
> > > > > > > > > > important will drive home what the execution_date is :)
> > > > > > > > > > Regards
> > > > > > > > > > Sam
> > > > > > > > > >
> > > > > > > > > > > On Wed, Sep 26, 2018 at 4:10 PM Bolke de Bruin
> > > > > > > > > > > bdbruin@xxxxxxxxx
> > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > > > I dont think this makes sense and I dont that think
> > anyone had
> > > > > > > > > > > a
> > > > > > > > > > > real
> > > > >
> > > > > > > > > > > issue with this. Execution date has been clearly
> > documented
> > > > > > > > > > > and is
> > > > >
> > > > > > > > part
> > > > > > > >
> > > > > > > > > > of
> > > > > > > > > >
> > > > > > > > > > > the core principles of airflow. Renaming will create
> more
> > > > > > > > > > > confusion.
> > > > >
> > > > > > > > > > > Please note that I do think that as an anonymous user
> you
> > > > > > > > > > > cannot
> > > > >
> > > > > > > speak
> > > > > > >
> > > > > > > > > > for
> > > > > > > > > >
> > > > > > > > > > > any "new airflow user". That is a contradiction to me.
> > > > > > > > > > > Thanks
> > > > > > > > > > > Bolke
> > > > > > > > > > > Sent from my iPhone
> > > > > > > > > > >
> > > > > > > > > > > > On 26 Sep 2018, at 07:59, airflowuser
> > > > > > > > > > > > <airflowuser@xxxxxxxxxxxxxx
> > > > >
> > > > > > > > > > .INVALID>
> > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > One of the most annoying, hard to understand and
> > against all
> > > > > > > > > > > > common
> > > > >
> > > > > > > > > > > sense is the execution_date behavior. I assume that any
> > new
> > > > > > > > > > > Airflow
> > > > >
> > > > > > > > user
> > > > > > > >
> > > > > > > > > > > has been struggling with it.
> > > > > > > > > > >
> > > > > > > > > > > > The amount of questions with answers referring to :
> > > > > > > > > > > >
> > https://airflow.apache.org/scheduler.html?scheduling-triggers
> > > > > > > > > > > > is
> > > > >
> > > > > > > > > > > uncountable.
> > > > > > > > > > >
> > > > > > > > > > > > Most people mistakenly think that execution_date is
> the
> > > > > > > > > > > > datetime
> > > > >
> > > > > > > which
> > > > > > >
> > > > > > > > > > > the DAG started to run.
> > > > > > > > > > >
> > > > > > > > > > > > I suggest the following changes:
> > > > > > > > > > > >
> > > > > > > > > > > > 1.  Renaming the execution_date to something else
> like:
> > > > > > > > > > > >     run_stamped
> > > > > > > > > > > >
> > > > >
> > > > > > > > > > > This name won't cause people to get confused.
> > > > > > > > > > >
> > > > > > > > > > > > 2.  Adding a new variable which indicated the actual
> > datetime
> > > > > > > > > > > >     when
> > > > > > > > > > > >
> > > > >
> > > > > > > the
> > > > > > >
> > > > > > > > > > > DAG run was generated. call it execution_start_date.
> > People
> > > > > > > > > > > seem to
> > > > >
> > > > > > > > want
> > > > > > > >
> > > > > > > > > > > the information when the DAG actually started to be
> > > > > > > > > > > executed/run.
> > > > >
> > > > > > > > > > > > This is only naming changes. No need to actual change
> > the
> > > > > > > > > > > > behavior
> > > > >
> > > > > -
> > > > >
> > > > > > > > > > > This will only make things simpler as when user
> encounter
> > > > > > > > > > > run_stamped
> > > > > > > >
> > > > > > > > > > he
> > > > > > > > > >
> > > > > > > > > > > won't be confused by the name like execution_date
> >
> >
> >
>