git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Lineage


^this

On Sat, May 5, 2018, 15:37 Marcin Szymański <ms32035@xxxxxxxxx> wrote:

> Hi Bolke
>
> Great stuff. Pieces of this this remind work I have done for one
> organization. However in that case, instead of defining base classes like
> Dataset form scratch, I extended objects from SQLAlchemy, such as Metadata,
> Table, etc. This in turn allowed for automated SQL generation (with some
> changes to operators), defining data quality checks and many other cool
> things. Maybe it's worth going down that path? I am willing to share more
> details, if interested.
>
> Best
> Marcin
>
> On Sat, May 5, 2018, 22:49 Bolke de Bruin <bdbruin@xxxxxxxxx> wrote:
>
> > Hi All,
> >
> > I have made a first implementation that allows tracking of lineage in
> > Airflow and integration with Apache Atlas. It was inspired by Jeremiah’s
> > work in the past on Data Flow pipelines, but I think I kept it a little
> bit
> > simpler.
> >
> > Operators now have two new parameters called “inlets” and “outlets”.
> These
> > can be filled with objects derived from “DataSet”, like “File” and
> > “HadoopFile”. Parameters are jinja2 templated, which
> > means they receive the context of the task when it is running and get
> > rendered. So you can get definitions like this:
> >
> > f_final = File(name="/tmp/final")
> > run_this_last = DummyOperator(task_id='run_this_last', dag=dag,
> >     inlets={"auto": True},
> >     outlets={"datasets": [f_final,]})
> >
> > f_in = File(name="/tmp/whole_directory/")
> > outlets = []
> > for file in FILE_CATEGORIES:
> >     f_out = File(name="/tmp/{}/{{{{ execution_date }}}}".format(file))
> >     outlets.append(f_out)
> > run_this = BashOperator(
> >     task_id='run_after_loop', bash_command='echo 1', dag=dag,
> >     inlets={"auto": False, "task_ids": [], "datasets": [f_in,]},
> >     outlets={"datasets": outlets}
> >     )
> > run_this.set_downstream(run_this_last)
> >
> > So I am trying to keep to boilerplate work down for developers. Operators
> > can also extend inlets and outlets automatically. This will probably be a
> > bit harder for the BashOperator without some special magic, but an update
> > to the DruidOperator can be relatively quite straightforward.
> >
> > In the future Operators can take advantage of the inlet/outlet
> definitions
> > as they are also made available as part of the context for templating (as
> > “inlets” and “outlets”).
> >
> > I’m looking forward to your comments!
> >
> > https://github.com/apache/incubator-airflow/pull/3321
> >
> > Bolke.
>


( ! ) Warning: include(msgfooter.php): failed to open stream: No such file or directory in /var/www/git/apache-airflow-development/msg03265.html on line 135
Call Stack
#TimeMemoryFunctionLocation
10.0007358472{main}( ).../msg03265.html:0

( ! ) Warning: include(): Failed opening 'msgfooter.php' for inclusion (include_path='.:/var/www/git') in /var/www/git/apache-airflow-development/msg03265.html on line 135
Call Stack
#TimeMemoryFunctionLocation
10.0007358472{main}( ).../msg03265.html:0