git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Java] tracking null values optimally


Hi,

As I pointed out in my previous email, the C++ code has an optimization for
the cases where (i) there are no null values; (ii) or all values are null.
Java code path does not have it. I am trying to implement this feature. It
would look something like:

public int isSet(int index) {
    if(nullCount == valueCount)
        return 0;
     else if (nullCount == 0)
       return 1;
   else {
       final int byteIndex = index >> 3;
      final byte b = validityBuffer.getByte(byteIndex);
      final int bitIndex = index & 7;
      return (b >> bitIndex) & 0x01;
    }
}

The current problem is that "nullCount" is not explicitly tracked in the
Java code. It is checked by calling

public int getNullCount() {
    return BitVectorHelper.getNullCount(validityBuffer, valueCount);
}

which is not very optimal, and cannot be called everytime in isSet(). I see
in the source code there is a TODO about this
https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L75
which says: "Right now BaseValueVector is the top level base class for
other vector types in ValueVector hierarchy (non-nullable) and those
vectors have not yet been refactored/removed so moving things to the top
class as of now is not a good idea."

(1) I am not sure what this means? can someone explain? Why is not a good
idea?
(2) I think there is another branch of AbstractContainerVector which does
not share BaseValueVector class as the top-level base class.
AbstractContainerVector implements ValueVector (which is an interface).

In the C++ code, data and bitmap are both stored in the top-level Array
class, which probably is not possible in the Java implementation. However
we can move the bitmap operations to the "BaseValueVector" class. I don't
know what to do about the AbstractContainerVector path. Perhaps some code
needs to be duplicated there.

(3) Is this the right design choice? Any inputs?

Thanks,
--
Animesh