git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Writing empty strings to parquet files


Thanks Sergio. If we don't have any unit tests explicitly testing
this, it would be a good idea to add some anyway.

- Wes

On Fri, May 4, 2018 at 12:26 PM,  <scarrascoso@xxxxxxxxxxxxx> wrote:
> Hi Uwe:
>
> Thanks a lot for your feedback.
>
> While preparing a simple example to reproduce this issue, I have been able to get the expected behavior (empty strings properly written as ‘’ in the parquet file).
> So actually there’s no problem with the Parquet.write_table
>
> The problem was rather in a bug whereas two steps in my process were in the wrong order, so None values were being applied unicode formatting earlier than expected, thus becoming ‘None’.
>
> Again, thank you very much and apologies for the noise.
>
> Best,
>
> Sergio Carrascoso
>
>> On 4 May 2018, at 10:54, Uwe L. Korn <uwelk@xxxxxxxxxx> wrote:
>>
>> Hello Sergio,
>>
>> this is definitely unwanted behaviour. Can you open an issue on https://issues.apache.org/jira/projects/PARQUET and provide a minimal reproducing example. There is definitely a difference between empty strings and null strings. Parquet also supports the differentiation thus we should support roundtripping them.
>>
>> Uwe
>>
>> On Thu, May 3, 2018, at 8:47 AM, scarrascoso@xxxxxxxxxxxxx wrote:
>>>
>>> Hi:
>>>
>>> I would like to know if there is any way in PyArrow to write empty
>>> string values to a parquet file.
>>> When I use Parquet.write_table, if any column contains empty string
>>> values, they end up as None in the parquet file.
>>> My process depends on these values to be properly written as empty
>>> strings in the parquet files.
>>>
>>> To provide some context, my current worflow is the following:
>>>
>>> - Read content from json files (using Pandas.read_json)
>>> - Convert the corresponding dataframe to a PyArrow table (using
>>> PyArrow.Table.from_pandas)
>>> - Finally, write the table to a parquet file (using Parquet.write_table)
>>>
>>> I have done some checks during the process, and the empty string values
>>> are being honored until the writing step to a parquet file.
>>>
>>> The options for the write_table method don't provide any specific for
>>> this, is this behavior (write '' as None) an unavoidable default?
>>> Is there any other way to write the parquet files where I have more
>>> options to deal with this?
>>>
>>> Any hint or feedback will be greatly appreciated.
>>>
>>> Thanks a lot in advance, all the best.
>>>
>>> Sergio Carrascoso
>>>
>