git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Peak memory usage for pyarrow.parquet.read_table


Uwe,

I am not. Should I be? I forgot to mention earlier that the Parquet file
came from Spark/PySpark.

On Wed, Apr 25, 2018 at 1:32 PM Uwe L. Korn <uwelk@xxxxxxxxxx> wrote:

> Hello Bryant,
>
> are you using any options on `pyarrow.parquet.read_table` or a possible
> `to_pandas` afterwards?
>
> Uwe
>
> On Wed, Apr 25, 2018, at 7:27 PM, Bryant Menn wrote:
> > I tried reading a Parquet file (<200MB, lots of text with snappy) using
> > read_table and saw the memory usage peak over 8GB before settling back
> down
> > to ~200MB. This surprised me as I was expecting to be able to handle a
> > Parquet file of this size with much less RAM (doing some processing with
> > smaller VMs).
> >
> > I am not sure if this expected, but I thought I might check with everyone
> > here and learn something new. Poking around it seems to be related with
> > ParquetReader.read_all?
> >
> > Thanks in advance,
> > Bryant
>