git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Py] writing 2- or 4-byte decimal columns to Parquet


Wes & Phillip, thank you both for doing some investigating.  Really interesting
-- afaict I should be taking advantage of the column width shrinking; but
the error messages I'm seeing from Redshift Spectrum suggest otherwise:
more info https://github.com/hellonarrativ/spectrify/issues/14

I'm probably doing something silly; hopefully I can help improve the docs
at least :) Probably best to continue the conversation in the Spectrify
issue until there's more info.

Best,
Colin




On Thu, Apr 19, 2018 at 9:54 AM, Phillip Cloud <cpcloud@xxxxxxxxx> wrote:

> That's right. Shrinking happens here:
> https://github.com/apache/parquet-cpp/blob/master/src/
> parquet/arrow/writer.cc#L808-L809
>
> On Thu, Apr 19, 2018 at 9:40 AM Wes McKinney <wesmckinn@xxxxxxxxx> wrote:
>
> > We do "shrink" the input 128-bit decimals to the smallest number of
> > bytes that fits, though, is that right?
> >
> >
> > https://github.com/apache/parquet-cpp/blob/
> c405bf36506ec584e8009a6d53349277e600467d/src/parquet/arrow/schema.cc#L635
> >
> > On Thu, Apr 19, 2018 at 8:09 AM, Phillip Cloud <cpcloud@xxxxxxxxx>
> wrote:
> > > Hi Colin,
> > >
> > > Only 128 bit decimal writing is supported right now. Feel free to open
> a
> > > JIRA about this.
> > >
> > > On Wed, Apr 18, 2018, 19:10 Wes McKinney <wesmckinn@xxxxxxxxx> wrote:
> > >
> > >> hi Colin,
> > >>
> > >> Phillip Cloud is the expert on this topic, but I believe we only
> > >> support writing decimals to FIXED_LEN_BYTE_ARRAY physical type in
> > >> Parquet right now
> > >>
> > >>
> > >>
> > https://github.com/apache/parquet-cpp/blob/master/src/
> parquet/arrow/writer.cc#L798
> > >>
> > >> The size of the type depends on the decimal precision, so if we can
> > >> write to 32- or 64-bit, then we do that. Writing to INT32 or INT64
> > >> would be more complicated and require some work in parquet-cpp
> > >>
> > >> - Wes
> > >>
> > >> On Wed, Apr 18, 2018 at 7:04 PM, Colin Nichols <colin@xxxxxxxxxxxx>
> > wrote:
> > >> > Hi all,
> > >> >
> > >> > Any thoughts on the below?  I did a little more code browsing and
> I'm
> > not
> > >> > sure this is supported right now, should I open a Jira ticket?
> > >> >
> > >> > - Colin
> > >> >
> > >> > On Tue, Apr 17, 2018 at 11:11 PM, Colin Nichols <colin@xxxxxxxxxxxx
> >
> > >> wrote:
> > >> >
> > >> >> Hi there,
> > >> >>
> > >> >> I know (py)arrow has the decimal128() type, and using this type
> it's
> > >> easy
> > >> >> to take an array of Python Decimals, convert to a pa.array, and
> write
> > >> out
> > >> >> to Parquet.
> > >> >>
> > >> >> In the absence (afaict) of decimal32 and decimal64 types, is it
> > possible
> > >> >> to go from an array of Decimals (with compatible precision/scale)
> and
> > >> write
> > >> >> them to a parquet column of 32- or 64- bit width?
> > >> >>
> > >> >> Relevant parquet spec -- https://github.com/apache/
> > >> >> parquet-format/blob/master/LogicalTypes.md#decimal
> > >> >>
> > >> >> I'm looking to add this functionality to the project Spectrify, as
> > AWS
> > >> >> Redshift Spectrum will not query unnecessarily-wide DECIMAL columns
> > --
> > >> >> https://github.com/hellonarrativ/spectrify/issues/14
> > >> >>
> > >> >> Thanks,
> > >> >> Colin
> > >> >>
> > >> >>
> > >>
> >
>


( ! ) Warning: include(msgfooter.php): failed to open stream: No such file or directory in /var/www/git/apache-arrow-development/msg04234.html on line 173
Call Stack
#TimeMemoryFunctionLocation
10.0006368696{main}( ).../msg04234.html:0

( ! ) Warning: include(): Failed opening 'msgfooter.php' for inclusion (include_path='.:/var/www/git') in /var/www/git/apache-arrow-development/msg04234.html on line 173
Call Stack
#TimeMemoryFunctionLocation
10.0006368696{main}( ).../msg04234.html:0