git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[jira] [Created] (ARROW-3909) [Python] Table.from_pandas call that seemingly should zero copy does not


Wes McKinney created ARROW-3909:
-----------------------------------

             Summary: [Python] Table.from_pandas call that seemingly should zero copy does not
                 Key: ARROW-3909
                 URL: https://issues.apache.org/jira/browse/ARROW-3909
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
            Reporter: Wes McKinney
             Fix For: 0.12.0


While doing some performance testing, I noticed that a {{Table.from_pandas}} call that ought to be zero-copy / free was taking 50ms

{code}
import pandas as pd
import pyarrow as pa
import numpy as np

K = 1000
N = 50000000
df = pd.DataFrame({'ints': np.tile(np.arange(K), N // K)})
table = pa.Table.from_pandas(df)
{code}

I see

{code}
In [14]: timeit table = pa.Table.from_pandas(df)
51.9 ms ± 751 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
{code}

I haven't determined what's going on (is it counting nulls?), and initial attempts to get a Flamegraph produced a bunch of "unknown" entries



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)