Re: Fundamental change - Separate DAG name and id.
Re: [Brian Greene] "How does filename matter? Frankly I wish the filename
was REQUIRED to be the dag name so people would quit confusing themselves
by mismatching them !"
FWIW in the Facebook predecessor to airflow, the file path/name WAS the dag
name. E.g. if your dag resided in best_team/new_project/sweet_dag.py then
the dag name would be best_team.new_project.sweet_dag
All tasks were identified by their variable name after that prefix: E.g. if
best_team.new_project.sweet_dag defines an operator in a variable named
task1, then the respective task_id is best_team.new_project.sweet_dag.task1.
Airflow provides additional flexibility to specify DAG and task names to
avoid the sometimes annoyingly long task names this resulted in and allow
DAG/task names without forcing a code directory structure and python's
variable naming restrictions, and I think this is a Good Thing.
It seems like airflowuser is trying to provide additional metadata beyond
the DAG/task names (so far, a DAG 'title' distinct from the ID). I've
provided this through a README.md included in the DAG source directory, but
maybe it would be a win to instead add a DAG parameter named 'readme' of
string type which can include a docstring or even markdown to provide any
desired additional metadata? This could then be displayed by the UI to
simplify access to any such provided DAG documentation.
On Thu, Sep 20, 2018 at 10:45 PM Brian Greene <
> Prior to using airflow for much, on first inspection, I think I may have
> agreed with you.
> After a bit of use I’d agree with Fokko and others - this isn’t really a
> problem, and separating them seems to do more harm than good related to
> I was gonna stop there, but why?
> You can add a task to a dag that’s deployed and has run and still view
> history. The “new” task shows up white Squares in the old dags. nobody
> said you’re required to also rename the dag when you do so this. If your
> process or desire or design determines you need to rename it, well then by
> definition... isn’t it a new thing without a history? Airflow is
> implementing exactly that.
> One could argue that renaming to reflect exact purpose is good practice.
> Yes, I’d agree, but again following that logic if it’s a small enough
> change to “slip in” then the name likely shouldn’t change. If it’s big
> enough I want to change the name then it’s a big enough change that I’m
> functionally running something “new”, and I expect to need to account for
> that. Airflow is enforcing that logic by coupling the name to the
> deployment of what you said was a new process.
> One might put forth that changing the name to be more descriptive In the
> ui makes it easier for support staff. I think perhaps if that’s your
> challenge it’s not airflow that’s a problem. Dags are of course documented
> elsewhere besides their name, right? Yeah it’s self documenting (and the
> graphs are cool), but I have to assume there’s something besides the NAME
> to tell people what it does. Additionally, far more than the name is
> required for even an operator or monitor watcher to take action - you don’t
> expect them to know which tasks to rerun or how to troubleshoot failures
> just based on your “now most descriptive name in the UI” do you?
> I spent time In an informatica shop where all the jobs were numbered.
> Numbered. Let’s be more exact... their NAMES were NUMBERS like 56709.
> Terrible, but 100% worked, because while a descriptive name would have been
> useful, the name is the thing that’s supposed to NOT CHANGE (see code of
> Abibarshim), and all the other information can attach to that in places
> where you write... other information. People would curse a number “F’ing
> 6291 failed again” - everyone knew what they were talking about.. I digress.
> You might decide to document “dag ID 12” or just “12” on your wiki - I’m
> going to document “daily_sales_import”. And when things start failing at
> 3am it’s not my dag “56” that’s failing, it’s the sales_export dag. But if
> you document “12”, that’s still it’s name, and it’d better be 12 in all
> your environments and documents. This also means the actual db IDs from
> your proposal are almost certainly NOT the same across your environments,
> making the 12 unchangeable name!
> There are lots of languages (most of them) where the name of a thing is
> important and hard to change. It’s not a bad thing, and I’d assume that
> deploying a thing by name has some significance in many systems. Go rename
> a class in... pick a language... tell me how that should be easier to do
> willy-nilly so it’s easier In the UI.
> I suppose you could view it as a limitation, But i don’t think you’ve
> illuminated a single use case where it’s an actual technical constraint or
> The BEST argument against the current implementation is db performance.
> It’s a hogwash argument. Basic key indexes on low cardinality string
> columns are plenty fast for the airflow workload, and if your task load is
> so high airflow can’t keep up or your seeing super-fast tasks and airflow
> db/tracking latency is too much... perhaps a messaging or queue processing
> solution is better suited to those workloads. We see scheduler bottlenecks
> long before the database for our “quick task” scenarios. Additionally,
> reading through this list you’ll find people running airflow at substantial
> scale - I’ve not seen anyone complaining of production performance issues
> based on this design decision. At first I hated it. String keys are
> dirty, we’re all taught that as good little programmers. Except when
> performance won’t be a huge consideration since it’s not OLTP and easy of
> queryabilty is more important because it’s a growing system... good
> decision - whoever made it.
> How does filename matter? Frankly I wish the filename was REQUIRED to be
> the dag name so people would quit confusing themselves by mismatching them
> ! We’ve renamed dag files with no issue as long as the content doesn’t
> change, so again, not a real use case. And really - name your stuff
> careful before you get to prod man.
> I gotta ask - airflowuser - are you gonna use airflow for anything, or
> just poke it with a stick from a distance and ask semi-inane questions of
> these fine folks that wrote and spend time working on this cool piece of
> Sent from a device with less than stellar autocorrect
> > On Sep 20, 2018, at 3:12 PM, Driesprong, Fokko <fokko@xxxxxxxxxxxxxx>
> > I like the dag_id for both the name and as an unique identifier. If you
> > change the dag in such a way, that it deserves a new name, you probably
> > want to create a new dag anyway. If you want to give some additional
> > context, you can use the description field:
> > The name of the file of dag does not have any influence.
> > My 2¢
> > Cheers, Fokko
> > Op do 20 sep. 2018 om 19:40 schreef James Meickle
> > <firstname.lastname@example.org>:
> >> I'm personally against having some kind of auto-increment numeric ID for
> >> DAGs. While this makes a lot of sense for systems where creation is a
> >> database activity (like a POST request), in Airflow, DAG creation is
> >> actually a code ship activity. There are all kinds of complex scenarios
> >> around that:
> >> - I revert a commit and a DAG disappears or is renamed
> >> - I run the same file, twice, with multiple parameters to create two
> >> - I create the DAG in both staging and prod, but they wind up with
> >> different IDs
> >> It's just too hard to automatically track these scenarios.
> >> If we really wanted to put something like this in place, it would first
> >> make more sense to decouple DAG creation from code shipping, and instead
> >> prefer creation of a DAG outside of code (but with a definition that
> >> references which git repo/committish/file/arguments/etc. to use). Then
> >> you do something like rename a file, the DAG breaks, but at least still
> >> exists in the db with that ID and history still makes sense once you
> >> the DAG definition with the new code location.
> >> On Thu, Sep 20, 2018 at 4:52 AM airflowuser
> >> <email@example.com> wrote:
> >>> Hi,
> >>> though this could have been explained on Jira I think this should be
> >>> discussed first.
> >>> The problem:
> >>> Airflow mixes DAG name with id. It uses same filed for both purposes.
> >>> I assume that most of you use the dag_id to describe what the DAG
> >> actually
> >>> does.
> >>> For example:
> >>> dag = DAG(
> >>> dag_id='cost_report_daily',
> >>> ...
> >>> )
> >>> This dag_id is reflected to the dag id column in the UI.
> >>> Now, lets say that you want to add another task to this specific dag -
> >> You
> >>> are to be extremely careful when you change the dag_id to represent the
> >> new
> >>> functionality for example : dag_id='cost_expenses_reports_daily' . This
> >>> will break the history of the DAG.
> >>> Or even with simpler use case.. the user just want to change the name
> >>> sees on the UI.
> >>> I suggest to have a discussion if the dag_id should be split into id
> >>> actual id) and name to reflect what it does. When the "connection" is
> >> done
> >>> by id's - names can change as much as you want without breaking
> >> anything.
> >>> essentially it becomes a field uses for display purpose only.
> >>> * I didn't mention also the issue of DAG file name which can also cause
> >>> trouble if someone wants to change it.
> >>> Sent with [ProtonMail](https://protonmail.com) Secure Email.