git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: PyArrow and Parquet DELTA_BINARY_PACKED


Hello Feras,

`DELTA_BINARY_PACKED` is at the moment only implemented in parquet-cpp on the read path. The necessary encoder implementation for this code is missing at the moment.

The change in file size is something I also don't understand. The only difference between the two versions is that with version 2, we encode uint32 columns in version 1 as INT64 whereas in version two, we can encode them as UINT32. This type was not available in version 1. It would be nice, if you could narrow down the issue to e.g. the column which causes the increase in size. You might also use the Java parquet-tools or parquet-cli to inspect the size statistics of the parts of the individual Parquet file.

Uwe

On Fri, May 11, 2018, at 3:07 AM, Feras Salim wrote:
> Hi, I was wondering if I'm missing something or currently the
> `DELTA_BINARY_PACKED` is only available for reading when it comes to
> parquet files, I can't find a way for the writer to encode timestamp data
> with `DELTA_BINARY_PACKED`, furthermore I seem to get about 10% increase in
> final file size when I change from ver 1 to ver 2 without changing anything
> else about the schema or data.