git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Writing empty strings to parquet files


Hi:

I would like to know if there is any way in PyArrow to write empty string values to a parquet file.
When I use Parquet.write_table, if any column contains empty string values, they end up as None in the parquet file.
My process depends on these values to be properly written as empty strings in the parquet files.

To provide some context, my current worflow is the following:

- Read content from json files (using Pandas.read_json)
- Convert the corresponding dataframe to a PyArrow table (using PyArrow.Table.from_pandas)
- Finally, write the table to a parquet file (using Parquet.write_table)

I have done some checks during the process, and the empty string values are being honored until the writing step to a parquet file.

The options for the write_table method don't provide any specific for this, is this behavior (write '' as None) an unavoidable default?
Is there any other way to write the parquet files where I have more options to deal with this?

Any hint or feedback will be greatly appreciated.

Thanks a lot in advance, all the best.

Sergio Carrascoso