[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[jira] [Created] (ARROW-3138) 'Couldn't deserialize thrift' error when reading large binary column

Jeremy Heffner created ARROW-3138:

             Summary: 'Couldn't deserialize thrift' error when reading large binary column
                 Key: ARROW-3138
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.10.0
         Environment: Ubuntu 16.04; Python 3.6; Pandas 0.23.4; Numpy 1.14.3 
            Reporter: Jeremy Heffner

We've run into issues reading Parquet files that contain long binary columns (utf8 strings).  In particular, we were generating WKT representations of polygons that contained ~34 million characters when we ran into the issue. 

The attached example generates a dataframe with one record and one column containing a random string with 10^7 characters.

Pandas (using the default pyarrow engine) successfully writes the file, but fails upon reading the file:
ArrowIOError Traceback (most recent call last)
<ipython-input-25-25d21204cbad> in <module>()
----> 1 df_read_in = pd.read_parquet('test.parquet')

~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/ in read_parquet(path, engine, columns, **kwargs)
287 impl = get_engine(engine)
--> 288 return, columns=columns, **kwargs)

~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/ in read(self, path, columns, **kwargs)
129 kwargs['use_pandas_metadata'] = True
130 result = self.api.parquet.read_table(path, columns=columns,
--> 131 **kwargs).to_pandas()
132 if should_close:
133 try:

~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/ in read_table(source, columns, nthreads, metadata, use_pandas_metadata)
1044 fs = _get_fs_from_path(source)
1045 return fs.read_parquet(source, columns=columns, metadata=metadata,
-> 1046 use_pandas_metadata=use_pandas_metadata)
1048 pf = ParquetFile(source, metadata=metadata)

~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/ in read_parquet(self, path, columns, metadata, schema, nthreads, use_pandas_metadata)
175 filesystem=self)
176 return, nthreads=nthreads,
--> 177 use_pandas_metadata=use_pandas_metadata)
179 def open(self, path, mode='rb'):

~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/ in read(self, columns, nthreads, use_pandas_metadata)
896 partitions=self.partitions,
897 open_file_func=open_file,
--> 898 use_pandas_metadata=use_pandas_metadata)
899 tables.append(table)

~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/ in read(self, columns, nthreads, partitions, open_file_func, file, use_pandas_metadata)
459 table = reader.read_row_group(self.row_group, **options)
460 else:
--> 461 table =**options)
463 if len(self.partition_keys) > 0:

~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/ in read(self, columns, nthreads, use_pandas_metadata)
150 columns, use_pandas_metadata=use_pandas_metadata)
151 return self.reader.read_all(column_indices=column_indices,
--> 152 nthreads=nthreads)
154 def scan_contents(self, columns=None, batch_size=65536):

~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.read_all()

~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowIOError: Couldn't deserialize thrift: No more data to read.
Deserializing page header failed.

This message was sent by Atlassian JIRA