git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Fwd: Class-based DAG Syntactic Sugar


---------- Forwarded message ---------
From: Max Goodridge <work@xxxxxxxxxxxxxxxx>
Date: Fri, Oct 26, 2018 at 9:41 PM
Subject: Class-based DAG Syntactic Sugar
To: <dev-subscribe@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>


Hello team,

I would like to make a proposal that some people have said would make sense
to merge upstream. If you like Django, it'll feel very familiar to you.

The following is copied from Slack - original here:
https://apache-airflow.slack.com/archives/CCY359SCV/p1538585182000100

We use Airflow to write a lot of DAGs, and coming from a Django background
I found it frustrating that I had to repeat myself and write DAGs in a way
that in my opinion could be more Pythonic. Specifically for example
specifying dependencies (90% of DAG operators have the same dependency
chain) and having to manually create a DAG instead of just defining a class
that does it for us. We currently use an abstraction layer on top of
Airflow. Its called “workflows” - essentially class-based DAGs. Let me
illustrate with an example of what our DAGs look like:
```
# This would create a DAG called `example_workflow` with two operators,
with the second dependant on the first and explicit DAG metadata (a
schedule) in this case.

class ExampleWorkflow(workflows.Workflow):
    class Meta:
        schedule_interval = '0 9 * * *'

    do_something_useful = workflows.PythonOperator(
        python_callable=python_callable,
    )
    something_else = workflows.PythonOperator(
        python_callable=python_callable,
    )
```
We also currently have an extra line to work around Airflow’s use of
globals for DAG collection but that would disappear nicely if we choose to
merge this abstraction upstream. I thought about doing some hacky things or
maintaining a custom fork but it was decided against for now.

Key points:
• Class attributes are the default `task_id` for associated operators,
otherwise the operators are the same (though `task_id` can be specified as
normal for easy backwards compatibility, and easier migration of old DAGs
to new syntax).
• The Django-inspired `Meta` class sets DAG information, including any
arg/kwarg that you’d normally specify directly in the `DAG` class
construction (could be anything else too though).
• *Default* (inherited) dependency structure that can be overridden by
overriding the relevant class method (our signature: `def dependencies(cls,
operators):`)
• Its *class-based* - that importantly means we can inheritance to
eliminate repeated operators and metadata (e.g. post to Slack, Datadog,
etc…)
• Using DAG metadata that could be inherited (assuming it exists somewhere
in the MRO) we can write similar DAGs very simply, according to the
inherited schedule for that particular domains schedule interval for
example.

Any questions welcome. I would be happy to make the necessary changes.
--- End of Slack Message ---

Thank you to Kaxil Naik for the feedback so far and the advice to post here
to gauge interest of this abstraction.

Thanks,
Max