git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Fundamental change - Separate DAG name and id.


Exactly that Ash! This awesome community never fails to impress. Thanks
folks!

On Tue, Sep 25, 2018 at 3:29 AM Ash Berlin-Taylor <ash@xxxxxxxxxx> wrote:

>
> > On 24 Sep 2018, at 23:12, Alex Tronchin-James 949-412-7220
> <(949)%20412-7220> <alex.n.james@xxxxxxxxx> wrote:
> >
> > Re: [Brian Greene] "How does filename matter?  Frankly I wish the
> filename
> > was REQUIRED to be the dag name so people would quit confusing themselves
> > by mismatching them !"
> >
> > FWIW in the Facebook predecessor to airflow, the file path/name WAS the
> dag
> > name. E.g. if your dag resided in best_team/new_project/sweet_dag.py then
> > the dag name would be best_team.new_project.sweet_dag
> > All tasks were identified by their variable name after that prefix: E.g.
> if
> > best_team.new_project.sweet_dag defines an operator in a variable named
> > task1, then the respective task_id is
> best_team.new_project.sweet_dag.task1.
> >
> > Airflow provides additional flexibility to specify DAG and task names to
> > avoid the sometimes annoyingly long task names this resulted in and allow
> > DAG/task names without forcing a code directory structure and python's
> > variable naming restrictions, and I think this is a Good Thing.
> >
> > It seems like airflowuser is trying to provide additional metadata beyond
> > the DAG/task names (so far, a DAG 'title' distinct from the ID). I've
> > provided this through a README.md included in the DAG source directory,
> but
> > maybe it would be a win to instead add a DAG parameter named 'readme' of
> > string type which can include a docstring or even markdown to provide any
> > desired additional metadata? This could then be displayed by the UI to
> > simplify access to any such provided DAG documentation.
>
> You mean like https://airflow.apache.org/concepts.html#documentation-notes
> <https://airflow.apache.org/concepts.html#documentation-notes> ? ✨
> >
> > 🍿
> >
> >
> >
> > On Thu, Sep 20, 2018 at 10:45 PM Brian Greene <
> > brian@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> >
> >> Prior to using airflow for much, on first inspection, I think I may have
> >> agreed with you.
> >>
> >> After a bit of use I’d agree with Fokko and others - this isn’t really a
> >> problem, and separating them seems to do more harm than good related to
> >> deployment.
> >>
> >> I was gonna stop there, but why?
> >>
> >> You can add a task to a dag that’s deployed and has run and still view
> >> history.  The “new” task shows up white Squares in the old dags.  nobody
> >> said you’re required to also rename the dag when you do so this.  If
> your
> >> process or desire or design determines you need to rename it, well then
> by
> >> definition... isn’t it a new thing without a history?  Airflow is
> >> implementing exactly that.
> >>
> >> One could argue that renaming to reflect exact purpose is good practice.
> >> Yes, I’d agree, but again following that logic if it’s a small enough
> >> change to “slip in” then the name likely shouldn’t change.  If it’s big
> >> enough I want to change the name then it’s a big enough change that I’m
> >> functionally running something “new”, and I expect to need to account
> for
> >> that.  Airflow is enforcing that logic by coupling the name to the
> >> deployment of what you said was a new process.
> >>
> >> One might put forth that changing the name to be more descriptive In the
> >> ui makes it easier for support staff.  I think perhaps if that’s your
> >> challenge it’s not airflow that’s a problem.  Dags are of course
> documented
> >> elsewhere besides their name, right?  Yeah it’s self documenting (and
> the
> >> graphs are cool), but I have to assume there’s something besides the
> NAME
> >> to tell people what it does.  Additionally, far more than the name is
> >> required for even an operator or monitor watcher to take action - you
> don’t
> >> expect them to know which tasks to rerun or how to troubleshoot failures
> >> just based on your “now most descriptive name in the UI” do you?
> >>
> >> I spent time In an informatica shop where all the jobs were numbered.
> >> Numbered.  Let’s be more exact... their NAMES were NUMBERS like 56709.
> >> Terrible, but 100% worked, because while a descriptive name would have
> been
> >> useful, the name is the thing that’s supposed to NOT CHANGE (see code of
> >> Abibarshim), and all the other information can attach to that in places
> >> where you write... other information.  People would curse a number
> “F’ing
> >> 6291 failed again” - everyone knew what they were talking about.. I
> digress.
> >>
> >> You might decide to document “dag ID 12” or just “12” on your wiki - I’m
> >> going to document “daily_sales_import”.  And when things start failing
> at
> >> 3am it’s not my dag “56” that’s failing, it’s the sales_export dag.
> But if
> >> you document “12”, that’s still it’s name, and it’d better be 12 in all
> >> your environments and documents.  This also means the actual db IDs from
> >> your proposal are almost certainly NOT the same across your
> environments,
> >> making the 12 unchangeable name!
> >>
> >> There are lots of languages (most of them) where the name of a thing is
> >> important and hard to change.  It’s not a bad thing, and I’d assume that
> >> deploying a thing by name has some significance in many systems.  Go
> rename
> >> a class in... pick a language... tell me how that should be easier to do
> >> willy-nilly so it’s easier In the UI.
> >>
> >> I suppose you could view it as a limitation, But i don’t think you’ve
> >> illuminated a single use case where it’s an actual technical constraint
> or
> >> limitation.
> >>
> >> The BEST argument against the current implementation is db performance.
> >> It’s a hogwash argument.  Basic key indexes on low cardinality string
> >> columns are plenty fast for the airflow workload, and if your task load
> is
> >> so high airflow can’t keep up or your seeing super-fast tasks and
> airflow
> >> db/tracking latency is too much... perhaps a messaging or queue
> processing
> >> solution is better suited to those workloads.  We see scheduler
> bottlenecks
> >> long before the database for our “quick task” scenarios.  Additionally,
> >> reading through this list you’ll find people running airflow at
> substantial
> >> scale - I’ve not seen anyone complaining of production performance
> issues
> >> based on this design decision.   At first I hated it.  String keys are
> >> dirty, we’re all taught that as good little programmers.  Except when
> >> performance won’t be a huge consideration since it’s not OLTP and easy
> of
> >> queryabilty is more important because it’s a growing system... good
> >> decision - whoever made it.
> >>
> >> How does filename matter?  Frankly I wish the filename was REQUIRED to
> be
> >> the dag name so people would quit confusing themselves by mismatching
> them
> >> !   We’ve renamed dag files with no issue as long as the content doesn’t
> >> change, so again, not a real use case.  And really - name your stuff
> >> careful before you get to prod man.
> >>
> >> I gotta ask - airflowuser - are you gonna use airflow for anything, or
> >> just poke it with a stick from a distance and ask semi-inane questions
> of
> >> these fine folks that wrote and spend time working on this cool piece of
> >> kit?
> >>
> >> B
> >>
> >> Sent from a device with less than stellar autocorrect
> >>
> >>> On Sep 20, 2018, at 3:12 PM, Driesprong, Fokko <fokko@xxxxxxxxxxxxxx>
> >> wrote:
> >>>
> >>> I like the dag_id for both the name and as an unique identifier. If you
> >>> change the dag in such a way, that it deserves a new name, you probably
> >>> want to create a new dag anyway. If you want to give some additional
> >>> context, you can use the description field:
> >>>
> >>
> https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L3131-L3132
> >>>
> >>> The name of the file of dag does not have any influence.
> >>>
> >>> My 2¢
> >>>
> >>> Cheers, Fokko
> >>>
> >>> Op do 20 sep. 2018 om 19:40 schreef James Meickle
> >>> <jmeickle@xxxxxxxxxxxxxx.invalid>:
> >>>
> >>>> I'm personally against having some kind of auto-increment numeric ID
> for
> >>>> DAGs. While this makes a lot of sense for systems where creation is a
> >>>> database activity (like a POST request), in Airflow, DAG creation is
> >>>> actually a code ship activity. There are all kinds of complex
> scenarios
> >>>> around that:
> >>>>
> >>>> - I revert a commit and a DAG disappears or is renamed
> >>>> - I run the same file, twice, with multiple parameters to create two
> >> DAGs
> >>>> - I create the DAG in both staging and prod, but they wind up with
> >>>> different IDs
> >>>>
> >>>> It's just too hard to automatically track these scenarios.
> >>>>
> >>>> If we really wanted to put something like this in place, it would
> first
> >>>> make more sense to decouple DAG creation from code shipping, and
> instead
> >>>> prefer creation of a DAG outside of code (but with a definition that
> >>>> references which git repo/committish/file/arguments/etc. to use). Then
> >> if
> >>>> you do something like rename a file, the DAG breaks, but at least
> still
> >>>> exists in the db with that ID and history still makes sense once you
> >> update
> >>>> the DAG definition with the new code location.
> >>>>
> >>>> On Thu, Sep 20, 2018 at 4:52 AM airflowuser
> >>>> <airflowuser@xxxxxxxxxxxxxx.invalid> wrote:
> >>>>
> >>>>> Hi,
> >>>>> though this could have been explained on Jira I think this should be
> >>>>> discussed first.
> >>>>>
> >>>>> The problem:
> >>>>> Airflow mixes DAG name with id. It uses same filed for both purposes.
> >>>>>
> >>>>> I assume that most of you use the dag_id to describe what the DAG
> >>>> actually
> >>>>> does.
> >>>>> For example:
> >>>>>
> >>>>> dag = DAG(
> >>>>>   dag_id='cost_report_daily',
> >>>>> ...
> >>>>> )
> >>>>>
> >>>>> This dag_id is reflected to the dag id column in the UI.
> >>>>> Now, lets say that you want to add another task to this specific dag
> -
> >>>> You
> >>>>> are to be extremely careful when you change the dag_id to represent
> the
> >>>> new
> >>>>> functionality for example : dag_id='cost_expenses_reports_daily' .
> This
> >>>>> will break the history of the DAG.
> >>>>>
> >>>>> Or even with simpler use case.. the user just want to change the name
> >> he
> >>>>> sees on the UI.
> >>>>>
> >>>>> I suggest to have a discussion if the dag_id should be split into id
> >> (an
> >>>>> actual id) and name to reflect what it does. When the "connection" is
> >>>> done
> >>>>> by id's  - names can change as much as you want without breaking
> >>>> anything.
> >>>>> essentially it becomes a field uses for display purpose  only.
> >>>>>
> >>>>> * I didn't mention also the issue of DAG file name which can also
> cause
> >>>>> trouble if someone wants to change it.
> >>>>>
> >>>>> Sent with [ProtonMail](https://protonmail.com) Secure Email.
> >>>>
> >>
>
>