git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [JAVA] Supporting zero copy arrow-vector


After digging into it a little deeper, I have more questions:

First, vector takes allocator. Zero copy means we should not do any
additional allocation which implies a dummy allocator with (at most)
capability of allocating zero length (getEmpty) ArrowBuf is sufficient.
However, there are places in vector that requires more allocation:

https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L511
https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java#L180

Vector will allocate in the case of all null or non null. Id does seem like
optimization that can be done, but why it reallocate without looking into
if validity buffer is really empty? Take fixed width vector as example, it
in fact does check buffers count is two, and for my simple test case, I saw
validity buffer is still being sent in non null case.

Second, arrow made a decision to only support off-heap buffer. Why? Doesn't
affect my use case, but sounds like this can be more flexible.

Unsafe supports two sets of APIs:
getXXX(long address)
getXXX(Object obj, long offset)
So it should work with both direct and off-heap:

initializing:
if (buffer.isDirect()) {
  this.ref = null;
  this.offset = getAddressOfDirect(buffer);
} else if (buffer.hasArray()) {
  this.ref = buffer.array();
  this.offset = Unsafe.ARRAY_BYTE_BASE_OFFSET + buf.arrayOffset();
}
this.offset += buffer.position();

then call:
theUnsafe.getXXX(this.ref, this.offset)

It likely has one more if branch comparing to getXXX(address).

Have a nice weekend!

On Fri, Sep 7, 2018 at 1:33 PM Zhenyuan Zhao <zzymtn@xxxxxxxxx> wrote:

> Thanks. That's crystal clear for me now.
>
> On Fri, Sep 7, 2018 at 1:16 PM Jacques Nadeau <jacques@xxxxxxxxxx> wrote:
>
>> I opened a jira to describe what I think needs to be done here. Check it
>> out:
>>
>> https://issues.apache.org/jira/browse/ARROW-3191
>>
>>
>> On Fri, Sep 7, 2018 at 10:47 AM Wes McKinney <wesmckinn@xxxxxxxxx> wrote:
>>
>> > Seems like you should be able to construct an UnsafeDirectByteBuf from
>> > a MappedByteBuffer, and then wrap that with UnsafeDirectLittleEndian
>> > to get zero-copy access to a memory map. Does that sound right?
>> >
>> >
>> >
>> https://github.com/netty/netty/blob/4.1/buffer/src/main/java/io/netty/buffer/UnpooledUnsafeDirectByteBuf.java
>> > On Fri, Sep 7, 2018 at 12:46 PM Zhenyuan Zhao <zzymtn@xxxxxxxxx> wrote:
>> > >
>> > > Interesting, so basically I can still use the public constructor
>> > >
>> > > public ArrowBuf(AtomicInteger refCnt, BufferLedger ledger,
>> > > UnsafeDirectLittleEndian byteBuf, BufferManager manager,
>> > > ArrowByteBufAllocator alloc, int offset, int length, boolean isEmpty)
>> > >
>> > > Instead, override BufferLedger/UnsafeDirectLittleEndian/BufferManager
>> to
>> > > make it reference existing buffer. That is a much more plausible
>> option
>> > as
>> > > it will reuse the Vectors. All I need is to implement my own
>> > deserializer.
>> > > Did I get you right?
>> > >
>> > > Thanks
>> > >
>> > > On Fri, Sep 7, 2018 at 7:09 AM Jacques Nadeau <jacques@xxxxxxxxxx>
>> > wrote:
>> > >
>> > > > It is on purpose that the ArrowBuf is final. It is done to ensure a
>> > single
>> > > > impl and performance reasons. ArrowBuf is primarily a memory address
>> > and a
>> > > > length and wants zero indirection to the reading/writing of that.
>> > > >
>> > > > It does, however, wrap several types of substructures as long as
>> they
>> > have
>> > > > that property. For example, an ArrowBuf almost always currently
>> wraps a
>> > > > Netty UnsafeDirectLittleEndian object. At that level you could
>> propose
>> > a
>> > > > way to wrap more types of memory addresses+lengths.
>> > > >
>> > > > On Thu, Sep 6, 2018, 10:26 PM Zhenyuan Zhao <zzymtn@xxxxxxxxx>
>> wrote:
>> > > >
>> > > > > Hello Team,
>> > > > >
>> > > > > I'm working on using arrow as intermediate format for transferring
>> > > > columnar
>> > > > > data from server to client. In this case, the client will only
>> need
>> > to
>> > > > read
>> > > > > from the format so I would like to avoid any unnecessary copy of
>> the
>> > > > data.
>> > > > > Looking into arrow, while arrow-format/flatbuffers does support
>> zero
>> > > > copy,
>> > > > > current arrow-vector java implementation is not. I was trying to
>> hack
>> > > > zero
>> > > > > copy for readonly scenarios, but saw two main blockers:
>> > > > >
>> > > > >    1.
>> > > > >
>> > > > >    ArrowBuf is the only buffer implementation used exclusively
>> across
>> > > > >    ArrowReader/ArrowRecordBatch/Vectors. It's final, which means
>> > there
>> > > > > isn't a
>> > > > >    way for me to override its logic in order to wrap some existing
>> > > > buffer.
>> > > > >    It's absolutely necessary to use ArrowBuf for write scenarios
>> due
>> > to
>> > > > > buffer
>> > > > >    allocation, but for read, I was hoping vector can just serve as
>> > view
>> > > > on
>> > > > > top
>> > > > >    of existing memory buffer (like java ByteBuffer or netty
>> ByteBuf).
>> > > > Seems
>> > > > >    safe for read only case.
>> > > > >    2.
>> > > > >
>> > > > >    As a result of #1 <https://github.com/apache/arrow/pull/1>
>> > described
>> > > > >    above, the only layer which seems reusable is the arrow-format.
>> > Then I
>> > > > > have
>> > > > >    to implement effectively a readonly copy of arrow-vector that
>> > > > references
>> > > > >    existing buffer. Put aside the effort doing that, it
>> introduces a
>> > big
>> > > > > gap
>> > > > >    to keep up with future changes/fixes made to arrow-vector.
>> > > > >
>> > > > > Wondering if you guys have put any thoughts into such readonly
>> > scenarios.
>> > > > > Any suggestion how I can approach this myself?
>> > > > >
>> > > > > Thanks
>> > > > >
>> > > >
>> >
>>
>