git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Guidelines on Contrib vs Non-contrib


I am working on adding GCP Cloud Functions operator - https://issues.apache.org/jira/browse/AIRFLOW-2912  (and soon more GCP-related ones like GCE and CloudSQL). For now I am adding those operators in contrib (soon I will prepare a PR).

I think it would indeed make sense to have those operators separated into it's own projects - it would make merging/rebasing etc. a bit easier at the expense of explicit management of dependencies (i.e. I imagine those external modules will have a dependency on some versions of Airflow - maybe "> x.y.z" version as it's important for the operators to use some of the objects/classes provided by Airflow (for example LoggingMixin).

I would be happy (sooner or later) to make such move, but I am pretty fresh in Airflow and I am afraid I would not understand all consequences just yet, but I have already few things that came to my mind. Maybe someone can shed some light on those things:

* the dependency management I mentioned above (not sure what is versioning scheme used by Airflow)

* whether those new projects/repos will be also part of Apache Incubator?  Or should they be completely independent, managed by the organisations/individuals that create them. How should we deal with responsibilities (discussed also in the separate thread linked earlier (https://lists.apache.org/thread.html/10be0c50a4aecdde66b1593cc30f0b0246035eb0b3281ee92744f783@%3Cdev.airflow.apache.org%3E) JIRA vs. GH issues). Currently all the code in airflow-incubator are community-owned (as discussed in the JIRA thread).  Not sure how it would look like for separate projects, how to manage contributors there etc. I think it should be super-easy to create/maintain such repo - following simple guide -  with very limited extra overhead, otherwise it will be a pain for the contributors to create and maintain separate repos.

* should JIRA issues in Airflow JIRA also relate to the new projects ? (I am in favour with sticking to JIRA BTW. It's much more powerful than GH issues and as long as there are some rules everyone follows, integration plugins configured it can be much better -  but some governance is indeed needed - I agree JIRA issues in Airflow are not really managed/manageable currently - especially by new contributors)

J

On 2018/09/18 18:01:55, James Meickle <jmeickle@xxxxxxxxxxxxxx.INVALID> wrote: 
> So in favor of just using Python modules for operators. I initially wrote
> mine as Airflow plugin compatible, and eventually had to un-write them that
> way, so it's really a new-user trap.
> 
> I've had at least a half dozen times installing/testing/operating Airflow
> where we had some issue based on an integration for a service we've never
> even used (like Hive). I would love to see all of that go away. However, we
> should make sure that it's not too onerous to get a fairly fully featured
> Airflow install, such as having a way for external repos/packages to even
> be discoverable.
> 
> On Tue, Sep 18, 2018 at 1:28 PM Driesprong, Fokko <fokko@xxxxxxxxxxxxxx>
> wrote:
> 
> > I fully agree with using plain Python modules :)
> >
> > I don't think a lot of hooks/operators graduate to core since it will break
> > the import. A few of them, for example Databricks and the Google hooks are
> > mature enough. For me the main point is having test coverage and a stable
> > API.
> >
> > Cheers, Fokko
> >
> > Op di 18 sep. 2018 om 18:30 schreef Victor Noagbodji <
> > vnoagbodji@xxxxxxxxxxxxxxxxxxxxx>:
> >
> > > yes, please!
> > >
> > > > On Sep 18, 2018, at 12:23 PM, Maxime Beauchemin <
> > > maximebeauchemin@xxxxxxxxx> wrote:
> > > >
> > > > +1 for deprecating operators/hooks as plugins, let's use Python's good
> > > old
> > > > python packages and maybe python "entry points" if we want to inject
> > them
> > > > in "airflow.operators"/"airflow.hooks" (which is probably not
> > necessary)
> > > >
> > > > On Tue, Sep 18, 2018 at 2:12 AM Ash Berlin-Taylor <ash@xxxxxxxxxx>
> > > wrote:
> > > >
> > > >> Operators and hooks don't need any special plugin system - simply
> > having
> > > >> them as as separate Python modules which are imported using normal
> > > python
> > > >> semantics is enough.
> > > >>
> > > >> In fact now that I think about it: I want to deprecate the plugins
> > > >> registering hooks/operators etc and limit it to only bits which a
> > simple
> > > >> python import can't manage - which I think is only anything that needs
> > > to
> > > >> be registered with another system, such as custom routes in the web
> > UI.
> > > >>
> > > >> I'll draft an AIP for this soon.
> > > >>
> > > >> -ash
> > > >>
> > > >>
> > > >>> On 18 Sep 2018, at 00:50, George Leslie-Waksman <waksman@xxxxxxxxx>
> > > >> wrote:
> > > >>>
> > > >>> Given we have a plugin system, could we alternatively move away from
> > > >>> keeping non-core supported code outside of the core project/repo?
> > > >>>
> > > >>> It would hugely decrease the surface area of the main repository and
> > > >>> testing infrastructure to get most of the contrib code out to its own
> > > >> place.
> > > >>>
> > > >>> Further, it would decrease the committer burden of having to
> > > >> approve/merge
> > > >>> code that is not supposed to be their responsibility.
> > > >>>
> > > >>> On Mon, Sep 17, 2018 at 4:37 PM Tim Swast <swast@xxxxxxxxxx.invalid>
> > > >> wrote:
> > > >>>
> > > >>>>> Individual operators and hooks living in separate repositories on
> > > >> github
> > > >>>> (or possibly other Apache projects), which are then distributed by
> > pip
> > > >> and
> > > >>>> installed as libraries seems like it would scale better.
> > > >>>>
> > > >>>> Pandas did this about a year ago, and it's seemed to have worked
> > well.
> > > >> For
> > > >>>> example, pandas.read_gbq is a very thin wrapper around
> > > >> pandas_gbq.read_gbq
> > > >>>> (distributed as a separate package). It has made it easier for me to
> > > >> track
> > > >>>> issues corresponding to my area of expertise.
> > > >>>>
> > > >>>> On Sun, Sep 16, 2018 at 1:25 PM Jakob Homan <jghoman@xxxxxxxxx>
> > > wrote:
> > > >>>>
> > > >>>>>> My understanding as a contributor is that if a hook/operator is in
> > > >>>> core,
> > > >>>>> it
> > > >>>>>> means that a committer is willing to take personal responsibility
> > to
> > > >>>>>> maintain it (or at least help maintain it), and everything else
> > goes
> > > >> in
> > > >>>>>> contrib.
> > > >>>>>
> > > >>>>> That's not correct.  All of the code is owned by the entire
> > > >>>>> community[1]; no one person is responsible for any of it.  There's
> > no
> > > >>>>> silos, fiefdoms, walled gardens, etc.  If the community cannot
> > > support
> > > >>>>> a piece of code it should be deprecated and subsequently removed.
> > > >>>>>
> > > >>>>> Contrib sections are almost always problematic for this reason.
> > > >>>>> Hadoop ended up abandoning its.  Because Airflow acts as a
> > gathering
> > > >>>>> point for so many disparate technologies (databases, storage
> > systems,
> > > >>>>> compute engines, etc.), trying to keep all of them corralled and up
> > > to
> > > >>>>> date will be very difficult.  Individual operators and hooks living
> > > in
> > > >>>>> separate repositories on github (or possibly other Apache
> > projects),
> > > >>>>> which are then distributed by pip and installed as libraries seems
> > > >>>>> like it would scale better.
> > > >>>>>
> > > >>>>> -Jakob
> > > >>>>>
> > > >>>>> [1]
> > > >> https://blogs.apache.org/foundation/entry/success-at-apache-a-newbie
> > > >>>>>
> > > >>>>> On 15 September 2018 at 13:29, Jeff Payne <jpayne@xxxxxxxxxxx>
> > > wrote:
> > > >>>>>> How many operators are added to contrib per month? Is it too many
> > to
> > > >>>>> make the decision case by case? If so, then the above mentioned
> > rule
> > > >>>> sounds
> > > >>>>> fairly reasonable. However, if that's the rule, shouldn't a bunch
> > of
> > > >>>>> existing modules be moved from contrib to core?
> > > >>>>>>
> > > >>>>>> Get Outlook for Android<https://aka.ms/ghei36>
> > > >>>>>>
> > > >>>>>> ________________________________
> > > >>>>>> From: Taylor Edmiston <tedmiston@xxxxxxxxx>
> > > >>>>>> Sent: Saturday, September 15, 2018 1:13:47 PM
> > > >>>>>> To: dev@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> > > >>>>>> Subject: Re: Guidelines on Contrib vs Non-contrib
> > > >>>>>>
> > > >>>>>> My understanding as a contributor is that if a hook/operator is in
> > > >>>> core,
> > > >>>>> it
> > > >>>>>> means that a committer is willing to take personal responsibility
> > to
> > > >>>>>> maintain it (or at least help maintain it), and everything else
> > goes
> > > >> in
> > > >>>>>> contrib.
> > > >>>>>>
> > > >>>>>> *Taylor Edmiston*
> > > >>>>>> Blog <https://blog.tedmiston.com/> | LinkedIn
> > > >>>>>> <https://www.linkedin.com/in/tedmiston/> | Stack Overflow
> > > >>>>>> <https://stackoverflow.com/users/149428/taylor-edmiston> |
> > > Developer
> > > >>>>> Story
> > > >>>>>> <https://stackoverflow.com/story/taylor>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> On Sat, Sep 15, 2018 at 2:02 PM Kaxil Naik <kaxilnaik@xxxxxxxxx>
> > > >>>> wrote:
> > > >>>>>>
> > > >>>>>>> Hi, all (mainly contributors),
> > > >>>>>>>
> > > >>>>>>> Can we decide on a common guideline on when a hook/operator
> > should
> > > go
> > > >>>>> under
> > > >>>>>>> contrib vs core?
> > > >>>>>>>
> > > >>>>>>> Regards,
> > > >>>>>>>
> > > >>>>>>> *Kaxil Naik*
> > > >>>>>>> *Big Data Consultant *@ *Data Reply UK*
> > > >>>>>>> *Certified *Google Cloud Data Engineer | *Certified* Apache
> > Spark &
> > > >>>>> Neo4j
> > > >>>>>>> Developer
> > > >>>>>>> *Phone: *+44 (0) 74820 88992
> > > >>>>>>> *LinkedIn*: https://www.linkedin.com/in/kaxil
> > > >>>>>>>
> > > >>>>>
> > > >>>> --
> > > >>>> *  •  **Tim Swast*
> > > >>>> *  •  *Software Friendliness Engineer
> > > >>>> *  •  *Google Cloud Developer Relations
> > > >>>> *  •  *Seattle, WA, USA
> > > >>>>
> > > >>
> > > >>
> > >
> > >
> >
>