[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How to model massive nested data

Is it also possible to iterate over the iterator<T2> more then once.
Can I have multiple iterators at different positions for iterator<T2> all
working independently?

On Thu, May 10, 2018 at 12:22 PM Tyler Akidau <takidau@xxxxxxxxxx> wrote:

> Hello Arrow folks,
> I've been skimming through the Arrow docs and code trying to figure out
> how one might model nested data structures where the nested portions
> themselves might be massive (i.e., larger than available memory). AFAICT,
> the nesting constructs in Arrow appear to assume that you can always fit an
> entire single record in memory. Am I right?
> Regardless, is there a recommended way of handling this use case? The
> thing we want to be able to model is essentially (in Java):
> Iterator<Pair<T1, Iterator<T2>>. So each row is a single T1 value
> associated with an arbitrarily large list of T2 values.
> I could imagine perhaps flattening the hierarchy down into a schema that's
> essentially Pair<T1, T2>, especially if the T1 and T2 values can be
> optional. So say I had two rows with T1 values of A and B and T2 lists of
> [1, 2] and [3] respectively (i.e., [<A, [1, 2]>, <B, [3]>]), then you could
> just have rows and columns like this:
> T1     | T2
> ---------------
> A      | <null>
> <null> | 1
> <null> | 2
> B      | <null>
> <null> | 3
> And then you'd presumably need to write wrapper code on top of Arrow to
> marshal all of this under an appropriate set of Interfaces.
> Is there a good way to handle this use case in Arrow as it exists today?
> If not, do you have a sense for how hard would it be to add support for
> something like this more natively?
> -Tyler