git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Writing empty strings to parquet files


Hi Wes:

Thanks for your message.

I would say that both test_pandas_parquet_1_0_rountrip and test_pandas_parquet_2_0_rountrip (in arrow/python/pyarrow/tests/test_parquet.py) already test this.
Sorry I didn’t realize this sooner.

All the best,

Sergio Carrascoso

> On 5 May 2018, at 01:31, Wes McKinney <wesmckinn@xxxxxxxxx> wrote:
> 
> Thanks Sergio. If we don't have any unit tests explicitly testing
> this, it would be a good idea to add some anyway.
> 
> - Wes
> 
> On Fri, May 4, 2018 at 12:26 PM,  <scarrascoso@xxxxxxxxxxxxx> wrote:
>> Hi Uwe:
>> 
>> Thanks a lot for your feedback.
>> 
>> While preparing a simple example to reproduce this issue, I have been able to get the expected behavior (empty strings properly written as ‘’ in the parquet file).
>> So actually there’s no problem with the Parquet.write_table
>> 
>> The problem was rather in a bug whereas two steps in my process were in the wrong order, so None values were being applied unicode formatting earlier than expected, thus becoming ‘None’.
>> 
>> Again, thank you very much and apologies for the noise.
>> 
>> Best,
>> 
>> Sergio Carrascoso
>> 
>>> On 4 May 2018, at 10:54, Uwe L. Korn <uwelk@xxxxxxxxxx> wrote:
>>> 
>>> Hello Sergio,
>>> 
>>> this is definitely unwanted behaviour. Can you open an issue on https://issues.apache.org/jira/projects/PARQUET and provide a minimal reproducing example. There is definitely a difference between empty strings and null strings. Parquet also supports the differentiation thus we should support roundtripping them.
>>> 
>>> Uwe
>>> 
>>> On Thu, May 3, 2018, at 8:47 AM, scarrascoso@xxxxxxxxxxxxx wrote:
>>>> 
>>>> Hi:
>>>> 
>>>> I would like to know if there is any way in PyArrow to write empty
>>>> string values to a parquet file.
>>>> When I use Parquet.write_table, if any column contains empty string
>>>> values, they end up as None in the parquet file.
>>>> My process depends on these values to be properly written as empty
>>>> strings in the parquet files.
>>>> 
>>>> To provide some context, my current worflow is the following:
>>>> 
>>>> - Read content from json files (using Pandas.read_json)
>>>> - Convert the corresponding dataframe to a PyArrow table (using
>>>> PyArrow.Table.from_pandas)
>>>> - Finally, write the table to a parquet file (using Parquet.write_table)
>>>> 
>>>> I have done some checks during the process, and the empty string values
>>>> are being honored until the writing step to a parquet file.
>>>> 
>>>> The options for the write_table method don't provide any specific for
>>>> this, is this behavior (write '' as None) an unavoidable default?
>>>> Is there any other way to write the parquet files where I have more
>>>> options to deal with this?
>>>> 
>>>> Any hint or feedback will be greatly appreciated.
>>>> 
>>>> Thanks a lot in advance, all the best.
>>>> 
>>>> Sergio Carrascoso
>>>> 
>> 




( ! ) Warning: include(msgfooter.php): failed to open stream: No such file or directory in /var/www/git/apache-arrow-development/msg04375.html on line 145
Call Stack
#TimeMemoryFunctionLocation
10.0022368536{main}( ).../msg04375.html:0

( ! ) Warning: include(): Failed opening 'msgfooter.php' for inclusion (include_path='.:/var/www/git') in /var/www/git/apache-arrow-development/msg04375.html on line 145
Call Stack
#TimeMemoryFunctionLocation
10.0022368536{main}( ).../msg04375.html:0