git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Recruiting more maintainers for Apache Arrow


hi Marco,

some comments inline

On Sat, Jun 30, 2018 at 2:15 PM, Marco Neumann
<marco@xxxxxxxxxxxxxx.invalid> wrote:
> Hey,
>
> first of all, thanks a lot for your, Uwes, the mergers and contributors
> work. Now, to the maintainer problem:
>
> # Arrow as "a library"
> One thing that makes Arrow special is that it is not a single, but many
> libraries (one for each language) and many of them are not only a
> binding to a C/C++ lib, but partly a complete re-implementation of the
> protocol, e.g.:
>
> - C++: one core, but also contains Python specialties
> - Java: another core
> - Rust: yet another core
> - Python: a binding to C++ but also a lot more stuff because of Pandas
> ...
>
> And you two are maintaining all of them and I doubt that you have the
> capacities and knowledge to do this at the desired level of quality
> (which is natural, not a personal issue or offense). So this I would
> call "pseudo-maintenance", since you're solely the gatekeeper that does
> some shallow reviewing and has the burden to do the housekeeping and
> the merging. So why accepting these language bindings in the first
> place without bringing a core maintainer in place? For example, let's
> say someone proposes a binding to Haskell now. That should not be
> accepted as part of the official Apache implementation without a
> dedicated maintainer (ideally the PR-author would be that person, but
> there may others who step up).

The most development activity, and where we have the most need of
help, is in C++ and Python. The other area is in dev/CI infrastructure
and release management.

We're falling behind on implementation and design work involving
Java-land (I have been trying for about a year to hammer down an
improved Interval type), but that's a separate problem.

We are about to reach a point (particularly if Gandiva becomes part of
Apache Arrow) where more languages will become dependent on the C++
library. This makes the need for more C++ maintainers even more
urgent.

I think the other libraries have done a good job of self-managing
their code (e.g. Java, JavaScript), and I frequently merge patches
when there is a +1 or some other consensus.

>
> Right now, it might be too late to remove some of the incomplete / WIP
> implementations that don't have a core maintainer though.

Honestly, the incomplete/WIP projects are not causing any maintenance
burden. It's the main projects and their development lifecycle that is
creating a lot of work.

>
> # GitHub
> Another special thing to consider is that Arrow is (ab)using GitHub as
> a code hosting platform. Even as a contributor, this has obvious bad
> uncool consequences:

I think these issues are red herrings. If maintainers are more
motivated by the gamification of their open source contributions
rather than the health and success of the proejct, I really question
how valuable of a maintainer they are.

>
> - you have yet another issue hosting system to log in

I strongly dispute the notion that using JIRA is a deterrent to
maintainers. If anyone, it's a filter for drive-by contributors and
unserious maintainers. I say this as the project's primary JIRA
gardener.

> - there is yet another information channel to keep track of (this ML
>   for example, which has a semi-informative web interface telling you
>   can only login using Google but does not tell you how to subscribe to
>   the list)
> - links to issues don't work in the known magic way

I think these things might deter passers-by, but I don't see why they
would be a problem for someone who is concerned with the health of the
project. As the primary maintainer of the project, these things don't
impact me in any way.

> - you're merging the PRs by closing them; which is by all means a not
>   very nice way because it does not reflect the contributors work in
>   the project overview and personal profiles, but exactly this is a
>   large part of the GitHub community (btw: merging PRs without using
>   GitHubs merge button IS possible as bors/bors-ng proof)

For each patch you contribute, you get one contribution "point" on
GitHub, but it won't show that you have a PR "merged". I don't see why
we should have to comply with GitHub's gamified approach to open
source.

>
> So as a potential maintainer, this is already a bumper, since I know
> that there are things less confortable then the system I would get from
> any normal GitHub or Gitlab project.
>
> I'm not really sure how to solve this or if it should be solved (read
> about the laziness aspect in "Contribution VS Maintenance" below)

I don't mean to be too dismissive of these concerns (they are common;
people have a difficult time with change) -- I've been long critical
of people concerned with their "GitHub High Score". See some writing
on this from a while ago:
http://wesmckinney.com/blog/github-open-source-contributions/

>
> # Time / Payment
> Yes, this is indeed a big issue. From what I can tell from the open
> source projects I was involved in is that for large contributor crowds,
> you normally have full/half-time positions in place for the core
> maintainer (look at the Mozilla projects, the Blender Foundation, Gnome
> / Red Hat). So at one point I think maintaining isn't a part time /
> hobby thing anymore (w/o downgrading the hard work of Hobby-
> contributors, in contrast). I don't have a link at hand, but I recall
> some discussion about GitHub and it's importance for hiring (since it
> it acts as a CV) after MS bought it, and some of the responses are
> "doing all this work in your free time is a privilege of wealthy,
> mostly-white men", which without signing this statement in this really
> bare form already shows a problem of open source world.
>
> # Contribution VS Maintenance
> The very "nice" thing about patch/PR contribution is that you do your
> work and then you can walk away and it's the maintainers problem to
> release the artifact, upgrade/migrate your code and ensure that the
> tests you've written never break. It's comfortable. Being a maintainer
> means all the opposite things. And in the end, you get blamed for not
> supporting certain features (see the open source paragraph here https:/
> /blog.ghost.org/5/ ) or for security disasters (remember the OpenSSL
> disaster).
>
> I think together with the previous point this means, we have to get
> companies to pay for that work, and not just dump their features to an
> OSS repo.

This is a huge problem. I have recently made some significant personal
financial sacrifices to be able to engineer an arrangement where I can
provide more scalable full-time employment opportunities for Apache
Arrow maintainers. See:
http://wesmckinney.com/blog/announcing-ursalabs/.

Particularly in the United States, full-time employment is very
important to have health care and other benefits, so the best scenario
is for companies to sponsor full-time (100%, not 20%) maintainers.
What I have seen happen all too often is that a person might start out
spending 50-80% of their time doing OSS maintenance, and at some point
they get reassigned to proprietary projects and stop doing
maintenance.

>
> # Path to Maintainership
> So I think (from my narrow point of view!) that many people expect that
> the path from "outsider" to "maintainer" takes the route over "a lot of
> patch/PR contributions". If I'm reading your mail right, that is not
> necessarily the case for Apache projects and I think that's great. The
> "review PRs" path sounds great, but I think GitHub or any platform I'm
> aware don't do a good job in getting people to do so. I mean, I see a
> PR and a can leave a review, but for me it is not really clear which
> consequences this have (naturally, random people don't have a veto on
> changes). So I can jump in when I think something is wrong, but I
> cannot approve a PR. This makes sense, but it poses the question of
> "how?!". I mean, it is pretty clear on how to become a patch/PR
> contributor, but it is not clear on how to become a maintainer, at
> least not in an easy way. (I'm sure it's written down somewhere).

Since we just started a project wiki
(https://cwiki.apache.org/confluence/display/ARROW), I can write down
a list of all the things that I regularly do as a maintainer.

Being a "maintainer" is a project leadership role; you are a "prime
mover". it means you are doing all of the things that help the project
stay organized, move forward, and periodically make releases. I took
it upon myself to be the Arrow prime mover from the early days of the
project, but we now have a large enough user and contributor base that
it is unfair to me to continue bearing the load that I have in the
past.

>
> So, overall I think a clear Call for Action at the top of the README
> could help. Like "Hey, we're looking for maintainers, you could start
> by reviewing some PRs and after some reviews maintainers will just be
> the last gatekeeper and after some more time, you can even merge PRs on
> your own".
>
> # My personal contribution
> Triggered by this call for help, I'll try to get more involved in
> Python, C++ and Rust reviews.
>
> So, these are some thoughts that I hope may help.
>

Thanks for these comments, and much appreciate your help!

> Thanks again for addressing this issue and your time and passion,
> Marco
>
> On 2018/06/30 14:57:42, Wes McKinney <w...@xxxxxxxxx> wrote:
>> hi folks,>
>>
>> Arrow has grown by leaps and bounds over the last 2.5 years. We are>
>> approaching our 2000th patch and on track to surpass 200 unique>
>> contributors by year end.>
>>
>> All this contribution growth is great, but it has a hidden cost:
>
> the>
>> maintenance. The burden of maintaining the project: particularly>
>> reviewing and merging patches, has fallen on a very small number of>
>> people. From the commit logs, we can see how many patches each>
>> committer has merged:>
>>
>> $ git shortlog -csn
>
> d5aa7c46692474376a3c31704cfc4783c86338f2..master>
>>   1289  Wes McKinney>
>>    268  Uwe L. Korn>
>>     74  Korn, Uwe>
>>     54  Antoine Pitrou>
>>     52  Julien Le Dem>
>>     39  Philipp Moritz>
>>     18  Kouhei Sutou>
>>     18  Steven Phillips>
>>     13  Bryan Cutler>
>>     11  Jacques Nadeau>
>>     10  Phillip Cloud>
>>      8  Brian Hulette>
>>      5  Robert Nishihara>
>>      5  adeneche>
>>      4  GitHub>
>>      3  Sidd>
>>      3  siddharth>
>>      1  AbdelHakim Deneche>
>>      1  Your Name Here>
>>
>> So Uwe and I have merged ~84% of the patches in the project so far.>
>> This isn't a completely accurate reflection of the maintainer
>
> burden,>
>> since many others contribute to code reviews and other aspects of>
>> patch maintenance, and you have to be a committer to earn a place
>
> on>
>> this list.>
>>
>> I'm not sure what's the best way to address this problem. The
>
> quality>
>> of our code review has declined at times as we struggle to keep up>
>> with the flow of patches -- I don't think this is good. Having the>
>> patch queue pile up isn't great either. Personally, I'm having a>
>> difficult time balancing project maintenance and patch authoring,>
>> particularly in the last 6 months.>
>>
>> Unfortunately, many people believe that writing patches is the
>
> primary>
>> mode of contribution to an open source project. Apache projects>
>> explicitly state that non-patch contributions are valued in earning>
>> karma (committership and PMC membership). We're starting to have
>
> more>
>> corporate contributors come out of the woodwork, and while it's
>
> great>
>> for contributors to be paid to write patches for the project, they
>
> are>
>> rarely given the time and space to contribute meaningfully to>
>> maintenance.>
>>
>> Any thoughts about how we can grow the maintainership? Somehow we
>
> need>
>> to reach ~5-6 core maintainers over the next year.>
>>
>> Thanks,>
>> Wes>