[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[jira] [Created] (ARROW-2503) Trailing space character in RowGroup statistics of pyarrow.parquet.ParquetFile

Julius Neuffer created ARROW-2503:

             Summary: Trailing space character in RowGroup statistics of pyarrow.parquet.ParquetFile
                 Key: ARROW-2503
             Project: Apache Arrow
          Issue Type: Bug
    Affects Versions: 0.9.0
            Reporter: Julius Neuffer

When reading a parquet file containing a string column, the _RowGroup_ statistics contain a trailing space character for the string column. The example below shows the behavior.
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# create and write arrow table as parquet
df = pd.DataFrame({'string_column': ['some', 'string', 'values', 'here']})
table = pa.Table.from_pandas(df)
pq.write_table(table, 'example.parquet')

# read parquet file metadata and print string column statistics
pq_file = pq.ParquetFile(open('example.parquet', 'rb'))
print(pq_file.metadata.row_group(0).column(0).statistics.max) # yields b'values '
print(pq_file.metadata.row_group(0).column(0).statistics.min) # yields b'here '
For other data types I did not observe this problem, even though the statistics are always strings.

When reading the same file with _fastparquet_, there is no trailing space character, which implies that this problem occurs in the reading path of _pyarrow.parquet_. I am aware that this might well be an issue with _parquet-cpp_, but as I face this bug as a _pyarrow_ user, I report it here.

I'll try to investigate this further and report back here.

This message was sent by Atlassian JIRA