Re: Sync Call Notes
Personally, I am not really in favor of ABI stability in the short
term for a few reasons
* We don't have enough maintainers as is to keep up with the
development flow in the project
* It will may harm forward progress in the project's design. Because
the development team is so small and there are so few maintainers,
there has not been a great deal of feedback on the general factoring
of the C++ code. When the size of the development team grows, it would
be valuable to be able to revisit design decisions based on feedback
of new contributors yet to join the project
Basically, many ABI decisions have been made hurriedly and I think we
need the flexibility to fix our mistakes while the project is growing.
I think it would be more valuable to develop shared / reusable build
infrastructure to better accommodate an evolving ABI so that
rebuilding packages is not too onerous for downstream dependencies. In
large companies like Google that maintain monorepos, this problem is
solved by requiring all call sites associated with an ABI to be fixed
all at once. We probably won't be able to create a monorepo for all
projects that use Arrow, but we could make Turbodbc package rebuilds
easier, for example
In summary, until the Arrow developer group grows significantly
larger, I think we should expect the users of these libraries to "live
at HEAD". I do think we should make ABI changes transparent and
well-documented so the pain is minimized. For the moment, we still
have a lot of development work to do for more people to "care" about
Apache Arrow and invest in its success long term.
On Thu, Apr 19, 2018 at 1:38 PM, Antoine Pitrou <antoine@xxxxxxxxxx> wrote:
> Hi Uwe,
> Le 19/04/2018 à 18:42, Uwe L. Korn a écrit :
>>> 1) are we ok with paying the cost of pimpls? (mostly the indirection
>>> cost I guess, and the fact that we can't have inline methods/accessors
>> I'm not sure about how much of the cost we're ready to pay. There is a certain element to keeping a stable ABI (this is done fantastically by the NumPy people), you can do patch releases without consumers worrying if they need to rebuild their binaries.
>> The indirection on paths that call expensive functions is certainly no problem, i.e. if you have a table and select a column, this is an operation you don't do often, thus I think the overhead is acceptable. On the other hand, accessing the null_count or the length of an array is definitely an operation that is performed quite often. These should be as fast as possible.
>> I cannot give you a certain answer, once I have the relevant time, I'll try to implement and profile some of the possible approaches.
>>> 2) how do we do for things like ArrayData, which seems publicly exposed
>>> by design?
>> ArrayData is marked as internal and thus I would feel ok to break its ABI between non-major releases. If people really depend on its usage, then we should think of a clear way to make it public / non-internal.
> Perhaps we need a three-tiered approach?
> 1) a public and stable namespace ("arrow") with the goal to reach ABI
> stability post-1.0;
> 2) a public but still moving namespace ("arrow::unstable"?) where we
> generally try not to remove existing functionality and to honor API
> compatibility, but do not guarantee any sort of ABI stability;
> (this could have ArrayData, PrimitiveArray...)
> 3) an internal-use namespace ("arrow::internal"), which third-party
> projects can use at their own risk.
> (this should get all our internal helpers, including almost all CPython