git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Peak memory usage for pyarrow.parquet.read_table


Uwe,

I am not. Should I be? I forgot to mention earlier that the Parquet file
came from Spark/PySpark.

On Wed, Apr 25, 2018 at 1:32 PM Uwe L. Korn <uwelk@xxxxxxxxxx> wrote:

> Hello Bryant,
>
> are you using any options on `pyarrow.parquet.read_table` or a possible
> `to_pandas` afterwards?
>
> Uwe
>
> On Wed, Apr 25, 2018, at 7:27 PM, Bryant Menn wrote:
> > I tried reading a Parquet file (<200MB, lots of text with snappy) using
> > read_table and saw the memory usage peak over 8GB before settling back
> down
> > to ~200MB. This surprised me as I was expecting to be able to handle a
> > Parquet file of this size with much less RAM (doing some processing with
> > smaller VMs).
> >
> > I am not sure if this expected, but I thought I might check with everyone
> > here and learn something new. Poking around it seems to be related with
> > ParquetReader.read_all?
> >
> > Thanks in advance,
> > Bryant
>


( ! ) Warning: include(msgfooter.php): failed to open stream: No such file or directory in /var/www/git/apache-arrow-development/msg04291.html on line 97
Call Stack
#TimeMemoryFunctionLocation
10.0020363576{main}( ).../msg04291.html:0

( ! ) Warning: include(): Failed opening 'msgfooter.php' for inclusion (include_path='.:/var/www/git') in /var/www/git/apache-arrow-development/msg04291.html on line 97
Call Stack
#TimeMemoryFunctionLocation
10.0020363576{main}( ).../msg04291.html:0