git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Format] Pointer types / span types


I see what you're saying. I was thinking about the span indices as it
relates to data split across record batches -- if you had a shared
"reference" array it could be treated like a dictionary, so if span
indices split across record batches reference the same array, then it
could be sent in a dictionary batch.

On Wed, May 2, 2018 at 5:03 PM, Brian Hulette <brian.hulette@xxxxxxxx> wrote:
> List also references another (data) array which can be a different size, but
> rather than requiring it to be represented with a second schema, we make it
> a child of the List type. We could do the same thing for a Span type, and
> give it a new type of buffer that contains start/stop indices rather than
> offsets.
>
> To Antoine's point, maybe there's not enough demand to justify defining this
> type right now. I definitely agree that it would be good to see an example
> dataset before adding something like this.
>
> Brian
>
>
> On 05/02/2018 03:54 PM, Wes McKinney wrote:
>>>
>>> Perhaps that could be an argument for making span a core logical type?
>>
>> I think if anything, this argues that it should not be. Because "span"
>> references another array, which can be a different size, you need two
>> schemas to be able to make sense of it.
>>
>> In either case, I would be interested to see what modifications would
>> be proposed to Schema.fbs and an example dataset described with such a
>> schema (that is a single array, instead of multiple -- i.e. a
>> non-composite representation).
>>
>> For the record, if there are sufficiently common "composite" data
>> representations, I don't see a problem with developing community
>> standards based on the building blocks we already have
>>
>> - Wes
>>
>> On Wed, May 2, 2018 at 3:38 PM, Brian Hulette <brian.hulette@xxxxxxxx>
>> wrote:
>>>
>>> If this were accomplished at the application level, how would it work
>>> with
>>> the IPC formats? I'd think you'd need to have two separate files (or
>>> streams), since array 1 and array 2 will be different lengths. Perhaps
>>> that
>>> could be an argument for making span a core logical type?
>>>
>>> Brian
>>>
>>>
>>>
>>> On 05/02/2018 03:34 PM, Antoine Pitrou wrote:
>>>>
>>>> On Wed, 2 May 2018 10:12:37 -0400
>>>> Wes McKinney <wesmckinn@xxxxxxxxx> wrote:
>>>>>
>>>>> It sounds like the "span" type could be implemented as a composite of
>>>>> multiple Arrow arrays / schemas:
>>>>>
>>>>> array 1 (data)
>>>>> any schema
>>>>>
>>>>> array 2 (view)
>>>>> struct <
>>>>>     start: int64,
>>>>>     stop: int64
>>>>>>
>>>>>>
>>>>> Unless I'm missing something, this feels like an application-level
>>>>> concern rather than something that needs to be addressed in the
>>>>> columnar format / metadata.
>>>>
>>>> Well, couldn't the same theoretically be said about list arrays?
>>>> In the end, I suppose it all depends whether there's enough demand to
>>>> make it a core logical type inside Arrow, rather than something people
>>>> write custom code for in their application.
>>>>
>>>> Regards
>>>>
>>>> Antoine.
>>>
>>>
>