git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Fundamental change - Separate DAG name and id.


I like the dag_id for both the name and as an unique identifier. If you
change the dag in such a way, that it deserves a new name, you probably
want to create a new dag anyway. If you want to give some additional
context, you can use the description field:
https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L3131-L3132

The name of the file of dag does not have any influence.

My 2¢

Cheers, Fokko

Op do 20 sep. 2018 om 19:40 schreef James Meickle
<jmeickle@xxxxxxxxxxxxxx.invalid>:

> I'm personally against having some kind of auto-increment numeric ID for
> DAGs. While this makes a lot of sense for systems where creation is a
> database activity (like a POST request), in Airflow, DAG creation is
> actually a code ship activity. There are all kinds of complex scenarios
> around that:
>
> - I revert a commit and a DAG disappears or is renamed
> - I run the same file, twice, with multiple parameters to create two DAGs
> - I create the DAG in both staging and prod, but they wind up with
> different IDs
>
> It's just too hard to automatically track these scenarios.
>
> If we really wanted to put something like this in place, it would first
> make more sense to decouple DAG creation from code shipping, and instead
> prefer creation of a DAG outside of code (but with a definition that
> references which git repo/committish/file/arguments/etc. to use). Then if
> you do something like rename a file, the DAG breaks, but at least still
> exists in the db with that ID and history still makes sense once you update
> the DAG definition with the new code location.
>
> On Thu, Sep 20, 2018 at 4:52 AM airflowuser
> <airflowuser@xxxxxxxxxxxxxx.invalid> wrote:
>
> > Hi,
> > though this could have been explained on Jira I think this should be
> > discussed first.
> >
> > The problem:
> > Airflow mixes DAG name with id. It uses same filed for both purposes.
> >
> > I assume that most of you use the dag_id to describe what the DAG
> actually
> > does.
> > For example:
> >
> > dag = DAG(
> >     dag_id='cost_report_daily',
> > ...
> > )
> >
> > This dag_id is reflected to the dag id column in the UI.
> > Now, lets say that you want to add another task to this specific dag -
> You
> > are to be extremely careful when you change the dag_id to represent the
> new
> > functionality for example : dag_id='cost_expenses_reports_daily' . This
> > will break the history of the DAG.
> >
> > Or even with simpler use case.. the user just want to change the name he
> > sees on the UI.
> >
> > I suggest to have a discussion if the dag_id should be split into id (an
> > actual id) and name to reflect what it does. When the "connection" is
> done
> > by id's  - names can change as much as you want without breaking
> anything.
> > essentially it becomes a field uses for display purpose  only.
> >
> > * I didn't mention also the issue of DAG file name which can also cause
> > trouble if someone wants to change it.
> >
> > Sent with [ProtonMail](https://protonmail.com) Secure Email.
>