[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[jira] [Created] (ARROW-2515) Errors with DictionaryArray inside of ListArray or other DictionaryArray

Brent Kerby created ARROW-2515:

             Summary: Errors with DictionaryArray inside of ListArray or other DictionaryArray
                 Key: ARROW-2515
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.9.0
            Reporter: Brent Kerby

An exception ("KeyError: 26") is raised when .as_py() is called on elements of a ListArray over a DictionaryArray, or of a DictionaryArray with values in a DictionaryArray. Here are a couple tests that currently fail:

import pyarrow as pa

def test_dictionary_array_1():
    dict_arr = pa.DictionaryArray.from_arrays([0, 1, 0], ['a', 'b'])
    list_arr = pa.ListArray.from_arrays([0, 2, 3], dict_arr)
    assert list_arr.to_pylist() == [['a', 'b'], ['a']]

def test_dictionary_array_2():
    dict_arr = pa.DictionaryArray.from_arrays([0, 1, 0], ['a', 'b'])
    dict_arr2 = pa.DictionaryArray.from_arrays([0, 1, 2, 1, 0], dict_arr)
    assert dict_arr2.to_pylist() == ['a', 'b', 'a', 'b', 'a']
It appears that the problem is caused by the fact that the function box_scalar in scalar.pxi does not handle the case of dictionary array, as we currently have no DictionaryValue type. 


DictionaryArray.__getitem__ currently works around the lack of DictionaryValue type by dereferencing the index and constructs a scalar based on the value in the underlying dictionary. In other words, if we have a dictionary with int8 indices and string values, then the result of __getitem__ will be a StringValue (rather than a DictionaryValue). This works in simple cases but not in the more complex scenarios illustrated above.

I have a patch ready, which would add a DictionaryValue type similar to other scalar types, resolving these bugs and removing the need for a special-cased implementation of DictionaryArray.__getitem__. This DictionaryValue would contain a couple accessor properties, "indices_value" and "dictionary_value" to allow access to both the index in the dictionary as well as the looked-up value. Then DictionaryValue.as_py() would simply call .as_py() on the underlying dictionary_value. 

This message was sent by Atlassian JIRA