git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

How to model massive nested data


Hello Arrow folks,

I've been skimming through the Arrow docs and code trying to figure out how
one might model nested data structures where the nested portions themselves
might be massive (i.e., larger than available memory). AFAICT, the nesting
constructs in Arrow appear to assume that you can always fit an entire
single record in memory. Am I right?

Regardless, is there a recommended way of handling this use case? The thing
we want to be able to model is essentially (in Java): Iterator<Pair<T1,
Iterator<T2>>. So each row is a single T1 value associated with an
arbitrarily large list of T2 values.

I could imagine perhaps flattening the hierarchy down into a schema that's
essentially Pair<T1, T2>, especially if the T1 and T2 values can be
optional. So say I had two rows with T1 values of A and B and T2 lists of
[1, 2] and [3] respectively (i.e., [<A, [1, 2]>, <B, [3]>]), then you could
just have rows and columns like this:

T1     | T2
---------------
A      | <null>
<null> | 1
<null> | 2
B      | <null>
<null> | 3

And then you'd presumably need to write wrapper code on top of Arrow to
marshal all of this under an appropriate set of Interfaces.

Is there a good way to handle this use case in Arrow as it exists today? If
not, do you have a sense for how hard would it be to add support for
something like this more natively?

-Tyler