git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Writing empty strings to parquet files


Hi Uwe:

Thanks a lot for your feedback.

While preparing a simple example to reproduce this issue, I have been able to get the expected behavior (empty strings properly written as ‘’ in the parquet file).
So actually there’s no problem with the Parquet.write_table

The problem was rather in a bug whereas two steps in my process were in the wrong order, so None values were being applied unicode formatting earlier than expected, thus becoming ‘None’.

Again, thank you very much and apologies for the noise.

Best,

Sergio Carrascoso

> On 4 May 2018, at 10:54, Uwe L. Korn <uwelk@xxxxxxxxxx> wrote:
> 
> Hello Sergio,
> 
> this is definitely unwanted behaviour. Can you open an issue on https://issues.apache.org/jira/projects/PARQUET and provide a minimal reproducing example. There is definitely a difference between empty strings and null strings. Parquet also supports the differentiation thus we should support roundtripping them.
> 
> Uwe
> 
> On Thu, May 3, 2018, at 8:47 AM, scarrascoso@xxxxxxxxxxxxx wrote:
>> 
>> Hi:
>> 
>> I would like to know if there is any way in PyArrow to write empty 
>> string values to a parquet file.
>> When I use Parquet.write_table, if any column contains empty string 
>> values, they end up as None in the parquet file.
>> My process depends on these values to be properly written as empty 
>> strings in the parquet files.
>> 
>> To provide some context, my current worflow is the following:
>> 
>> - Read content from json files (using Pandas.read_json)
>> - Convert the corresponding dataframe to a PyArrow table (using 
>> PyArrow.Table.from_pandas)
>> - Finally, write the table to a parquet file (using Parquet.write_table)
>> 
>> I have done some checks during the process, and the empty string values 
>> are being honored until the writing step to a parquet file.
>> 
>> The options for the write_table method don't provide any specific for 
>> this, is this behavior (write '' as None) an unavoidable default?
>> Is there any other way to write the parquet files where I have more 
>> options to deal with this?
>> 
>> Any hint or feedback will be greatly appreciated.
>> 
>> Thanks a lot in advance, all the best.
>> 
>> Sergio Carrascoso
>>