git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: execution_date - can we stop the confusion?


Second use of “inane” on this subject.  Brilliant, less combative response Chris.

There’s another point.. left bound makes sense to some people, right bound to others.

There’s no way to know or measure how “hard” this is to new users, so even if the change was made - new name, use right bound... how can you be sure you’re not actually confusing a LARGER number of new users from that point on.

It’s like left handed versus right handed people, except there’s no statistical basis for your argument that one group is larger than the other, or that there would actually be a measurable uptick in understanding and usability across the ENTIRE user community.

So your proposal 100% breaks backwards compatibility of code AND concept, on anecdotal evidence that it would somehow make usage magically easier?

Airflow is like a bulldozer made out of scalpels that can fly(not well, but it’s possible).  A slick dag can accomplish a staggering amount of work with the smallest little bit of elegant code.  Learning to “think in airflow” though is so, so much more than understanding execution date.  That’s barely table stakes in terms of concepts you’ll need to accept to be effective with airflow.

Maybe somebody just has a thing against lefty’s?  Some kind of left-bound-thinking conspiracy?

Sent from a device with less than stellar autocorrect

> On Sep 27, 2018, at 12:56 PM, Chris Palmer <chris@xxxxxxxxxxxx> wrote:
> 
> While taking a step back makes some sense, we also need to identify what
> the issue is. Simply saying 'execution_date behavior is confusing to new
> users' isn't good enough. What is confusing about it? Is it what it
> represents, or just the name itself?
> 
> There are a number of different timestamps that might be of interest,
> including (but not limited to):
> 
> *Identifying timestamp*
> For any time interval, there are two natural choices of timestamps to
> represent that interval, the left and right bounds. For Airflow the left
> bound has been chosen, and is called execution_date. For various reasons, I
> think that makes a much better choice than the right bound.
> 
> *Create/update/delete timestamps*
> Timestamps representing when particular database records where created,
> updated and or deleted. I don't believe that Airflow currently records
> these.
> 
> *Runtime timestamps*
> The timestamps that a task or other process started and stopped. Airflow
> records these for Tasks, but I think the implementation is maybe a little
> lacking for DagRuns.
> 
> 
> So what's the confusion with execution_date? Is it what it represents or
> the name itself?
> 
> I think part of the learning curve with Airflow is understanding that
> execution_date is the left bound of the interval. No matter what name you
> use for the identifying timestamp I think new users will need to learn what
> that choice means. Changing the name won't magically make all the confusion
> go away.
> 
> While I don't think execution_date is the greatest name in the world, it's
> a lot better than the suggested alternative run_stamped. Tasks also have an
> identifying timestamp, and if I saw run_stamped on a Task I would have no
> idea what it means (stamped by what?).
> 
> While there may be better names than execution_date, I don't think they are
> so much better that it is worth the effort to overhaul such an integral
> part of Airflow. Maybe some improvements to the documentation could be
> made, but nothing so drastic as to renaming such a core item.
> 
> 
> As for the second suggestion to add "a new variable which indicated the
> actual datetime when the DAG run was generated. call it
> execution_start_date". It is very unclear what the desired outcome is with
> this.
> 
> To me "generated" implies creation time, i.e. recorded in the database.
> However, creation of a DagRun record in the database is a distinct event
> from when Tasks associated with that DagRun start executing. Plus DagRuns
> themselves don't actually "run" - Tasks are the only thing that really gets
> run by Airflow.
> 
> What is actually desired here?
> - The right bound of the schedule interval?
> - The time the DagRun was created?
> - The time that any Tasks associated with a DagRun were first considered
> by the scheduler?
> - The time that any Tasks associated with a DagRun were first scheduled?
> - The time that any Tasks associated with a DagRun were actually started
> by a worker?
> 
> 
> The lack of clarity and completeness around these suggestions, alongside
> inane declarations like "This name won't cause people to get confused" is
> hardly a good way to get people to take suggestions seriously.
> 
> Chris
> 
> 
> On Wed, Sep 26, 2018 at 7:37 PM George Leslie-Waksman <waksman@xxxxxxxxx>
> wrote:
> 
>> This comes up a lot. I've seen it on this mailing list multiple times and
>> it's something that I have to explicitly call out to every single person
>> that I've helped train up on Airflow.
>> 
>> If we take a moment to set aside why things are the way they are, what the
>> documentation says, and how experienced users feel things should behave;
>> there still remains the fact that a lot of new users get confused by how
>> "execution_date" works.
>> 
>> Whether it's a problem, whether we need to do something, and what we could
>> do are all separate questions but I think it's important that we
>> acknowledge and start from:
>> 
>> A lot of new users get confused by how "execution_date" works.
>> 
>> I recognize that some of this is a learning curve issue and some of this is
>> a mindset issue but it begs the question: do enough users benefit from the
>> current structure to justify the harm to new users?
>> 
>> --George
>> 
>> On Wed, Sep 26, 2018 at 1:40 PM Brian Greene <
>> brian@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>> 
>>> It took a minute to grok, but in the larger context of how af works it
>>> makes perfect sense the way it is.  Changing something so fundamentally
>>> breaking to every dag in existence should bring a comparable benefit.
>>> Beyond the avoiding teaching a concept you disagree with, what benefits
>>> does the proposal bring to offset the cost of change?
>>> 
>>> I’m gonna make a meme - “do you even airflow bro?”
>>> 
>>> Sent from a device with less than stellar autocorrect
>>> 
>>>> On Sep 26, 2018, at 8:33 AM, Maxime Beauchemin <
>>> maximebeauchemin@xxxxxxxxx> wrote:
>>>> 
>>>> I think if you have a functional mindset (as in "functional data
>>> engineering
>>>> <
>>> 
>> https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a
>>>> ")
>>>> as opposed to a cron mindset, using the left bound of the time interval
>>>> makes a lot of sense. Things like your daily table partition keys align
>>>> with your Airflow execution_date.
>>>> 
>>>> The main thing is that whatever we do we cannot break backwards
>>>> compatibility. Offering both views (left bound/right bound), as it's
>> been
>>>> proposed before, either as an environment setting or a user personal
>>>> preference is even more confusing to me personally. Users would have to
>>>> switch context as they help each other or change environments.
>>>> 
>>>> Also note that your intuition may differ from other people's intuition,
>>> and
>>>> that "unlearning" something is way harder than learning something.
>>>> 
>>>> My personal take on this is to make this a rite of passage. This is
>> just
>>>> one of the many thing you have to learn when learning Airflow.
>>>> 
>>>> Max
>>>> 
>>>>> On Wed, Sep 26, 2018 at 8:18 AM Sam Elamin <hussam.elamin@xxxxxxxxx>
>>> wrote:
>>>>> 
>>>>> Hi Bolke
>>>>> 
>>>>> Speaking as a consultant who is constantly training other teams how to
>>> use
>>>>> airflow, I do frequently see this confusion.
>>>>> Another one is how the batch_date is always batch_date + interval or
>> as
>>> the
>>>>> docs make it quite clear
>>>>> 
>>>>> "*Let’s Repeat That* The scheduler runs your job one schedule_interval
>>>>> AFTER
>>>>> the start date, at the END of the period."
>>>>> 
>>>>> Renaming it would make it simpler for newbies, but essentially they
>> will
>>>>> need to understand how Airflow behaves, execution_date being the batch
>>>>> execution date not the run_date of the DAG
>>>>> 
>>>>> I am actually in the process of writing a blog post
>>>>> <
>> https://samelamin.github.io/2017/04/27/Building-A-Datapipeline-part1/>
>>>>> about this which I could use peoples feedback
>>>>> 
>>>>> If it helps, I find that explaining how backfills work and why they
>> are
>>>>> important will drive home what the execution_date is :)
>>>>> 
>>>>> 
>>>>> Regards
>>>>> Sam
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Wed, Sep 26, 2018 at 4:10 PM Bolke de Bruin <bdbruin@xxxxxxxxx>
>>> wrote:
>>>>>> 
>>>>>> I dont think this makes sense and I dont that think anyone had a real
>>>>>> issue with this. Execution date has been clearly documented  and is
>>> part
>>>>> of
>>>>>> the core principles of airflow. Renaming will create more confusion.
>>>>>> 
>>>>>> Please note that I do think that as an anonymous user you cannot
>> speak
>>>>> for
>>>>>> any "new airflow user". That is a contradiction to me.
>>>>>> 
>>>>>> Thanks
>>>>>> Bolke
>>>>>> 
>>>>>> Sent from my iPhone
>>>>>> 
>>>>>>> On 26 Sep 2018, at 07:59, airflowuser <airflowuser@xxxxxxxxxxxxxx
>>>>> .INVALID>
>>>>>> wrote:
>>>>>>> 
>>>>>>> One of the most annoying, hard to understand and against all common
>>>>>> sense is the execution_date behavior. I assume that any new Airflow
>>> user
>>>>>> has been struggling with it.
>>>>>>> The amount of questions with answers referring to :
>>>>>> https://airflow.apache.org/scheduler.html?scheduling-triggers  is
>>>>>> uncountable.
>>>>>>> 
>>>>>>> Most people mistakenly think that execution_date is the datetime
>> which
>>>>>> the DAG started to run.
>>>>>>> 
>>>>>>> I suggest the following changes:
>>>>>>> 1. Renaming the execution_date to something else like: run_stamped
>>>>>> This name won't cause people to get confused.
>>>>>>> 2. Adding a new variable which indicated the actual datetime when
>> the
>>>>>> DAG run was generated. call it execution_start_date. People seem to
>>> want
>>>>>> the information when the DAG actually started to be executed/run.
>>>>>>> 
>>>>>>> This is only naming changes. No need to actual change the behavior -
>>>>>> This will only make things simpler as when user encounter
>> run_stamped
>>>>> he
>>>>>> won't be confused by the name like execution_date
>>>>>> 
>>>>> 
>>> 
>>