git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Writing empty strings to parquet files


Hello Sergio,

this is definitely unwanted behaviour. Can you open an issue on https://issues.apache.org/jira/projects/PARQUET and provide a minimal reproducing example. There is definitely a difference between empty strings and null strings. Parquet also supports the differentiation thus we should support roundtripping them.

Uwe

On Thu, May 3, 2018, at 8:47 AM, scarrascoso@xxxxxxxxxxxxx wrote:
> 
> Hi:
> 
> I would like to know if there is any way in PyArrow to write empty 
> string values to a parquet file.
> When I use Parquet.write_table, if any column contains empty string 
> values, they end up as None in the parquet file.
> My process depends on these values to be properly written as empty 
> strings in the parquet files.
> 
> To provide some context, my current worflow is the following:
> 
> - Read content from json files (using Pandas.read_json)
> - Convert the corresponding dataframe to a PyArrow table (using 
> PyArrow.Table.from_pandas)
> - Finally, write the table to a parquet file (using Parquet.write_table)
> 
> I have done some checks during the process, and the empty string values 
> are being honored until the writing step to a parquet file.
> 
> The options for the write_table method don't provide any specific for 
> this, is this behavior (write '' as None) an unavoidable default?
> Is there any other way to write the parquet files where I have more 
> options to deal with this?
> 
> Any hint or feedback will be greatly appreciated.
> 
> Thanks a lot in advance, all the best.
> 
> Sergio Carrascoso
>