Re: Writing empty strings to parquet files
Thanks Sergio. If we don't have any unit tests explicitly testing
this, it would be a good idea to add some anyway.
On Fri, May 4, 2018 at 12:26 PM, <scarrascoso@xxxxxxxxxxxxx> wrote:
> Hi Uwe:
> Thanks a lot for your feedback.
> While preparing a simple example to reproduce this issue, I have been able to get the expected behavior (empty strings properly written as ‘’ in the parquet file).
> So actually there’s no problem with the Parquet.write_table
> The problem was rather in a bug whereas two steps in my process were in the wrong order, so None values were being applied unicode formatting earlier than expected, thus becoming ‘None’.
> Again, thank you very much and apologies for the noise.
> Sergio Carrascoso
>> On 4 May 2018, at 10:54, Uwe L. Korn <uwelk@xxxxxxxxxx> wrote:
>> Hello Sergio,
>> this is definitely unwanted behaviour. Can you open an issue on https://issues.apache.org/jira/projects/PARQUET and provide a minimal reproducing example. There is definitely a difference between empty strings and null strings. Parquet also supports the differentiation thus we should support roundtripping them.
>> On Thu, May 3, 2018, at 8:47 AM, scarrascoso@xxxxxxxxxxxxx wrote:
>>> I would like to know if there is any way in PyArrow to write empty
>>> string values to a parquet file.
>>> When I use Parquet.write_table, if any column contains empty string
>>> values, they end up as None in the parquet file.
>>> My process depends on these values to be properly written as empty
>>> strings in the parquet files.
>>> To provide some context, my current worflow is the following:
>>> - Read content from json files (using Pandas.read_json)
>>> - Convert the corresponding dataframe to a PyArrow table (using
>>> - Finally, write the table to a parquet file (using Parquet.write_table)
>>> I have done some checks during the process, and the empty string values
>>> are being honored until the writing step to a parquet file.
>>> The options for the write_table method don't provide any specific for
>>> this, is this behavior (write '' as None) an unavoidable default?
>>> Is there any other way to write the parquet files where I have more
>>> options to deal with this?
>>> Any hint or feedback will be greatly appreciated.
>>> Thanks a lot in advance, all the best.
>>> Sergio Carrascoso