[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Intro to pandas + pyarrow integration?

Ok, interesting. Thanks Wes, that does make it clear.

For other readers, this github issue is related:

On 7/6/18, 10:25 AM, "Wes McKinney" <wesmckinn@xxxxxxxxx> wrote:

>hi Alex,
>One of the goals of Apache Arrow is to define an open standard for
>in-memory columnar data (which may be called "tables" or "data frames"
>in some domains). Among other things, the Arrow columnar format is
>optimized for memory efficiency and analytical processing performance
>on very large (even larger-than-RAM) data sets.
>The way to think about it is that pandas has its own in-memory
>representation for columnar data, but it is "proprietary" to pandas.
>To make use of pandas's analytical facilities, you must convert data
>to pandas's memory representation. As an example, pandas represents
>strings as NumPy arrays of Python string objects, which is very
>wasteful. Uwe Korn recently demonstrated an approach to using Arrow
>inside pandas, but this would require a lot of work to port algorithms
>to run against Arrow:
>We are working to develop the standard data frame type operations as
>reusable libraries within this project, and these will run natively
>against the Arrow columnar format. This is a big project; we would
>love to have you involved with the effort. One of the reasons I have
>spent so much of my time the last few years on this project is that I
>believe it is the best path to build a faster, more efficient
>pandas-like library for data scientists.
>On Fri, Jul 6, 2018 at 1:05 PM, Alex Buchanan <buchanae@xxxxxxxx> wrote:
>> Hello all.
>> I'm confused about the current level of integration between pandas and pyarrow. Am I correct in understanding that currently I'll need to convert pyarrow Tables to pandas DataFrames in order to use most of the pandas features?  By "pandas features" I mean every day slicing and dicing of data: merge, filtering, melt, spread, etc.
>> I have a dataframe which starts out from small files (< 1GB) and quickly explodes into dozens of gigabytes of memory in a pandas DataFrame. I'm interested in whether arrow can provide a better, optimized dataframe.
>> Thanks.