git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Peak memory usage for pyarrow.parquet.read_table


No, there is no need to pass any options on reading. Sometimes they are beneficial depending on what you want to achieve but defaults are ok, too.

I'm not sure if you're able to post an example but it would be nice if you could post the resulting Arrow schema from the table. It might be related to a specific type. A quick way to debug this on your side would also be to specify only a subset of columns to read using the `columns=` attribute on read_table. Maybe you can already pinpoint the memory problems to a specific column. Having these hints would it make easier for us to diagnose what the underlying problem is.

Uwe

On Wed, Apr 25, 2018, at 8:06 PM, Bryant Menn wrote:
> Uwe,
> 
> I am not. Should I be? I forgot to mention earlier that the Parquet file
> came from Spark/PySpark.
> 
> On Wed, Apr 25, 2018 at 1:32 PM Uwe L. Korn <uwelk@xxxxxxxxxx> wrote:
> 
> > Hello Bryant,
> >
> > are you using any options on `pyarrow.parquet.read_table` or a possible
> > `to_pandas` afterwards?
> >
> > Uwe
> >
> > On Wed, Apr 25, 2018, at 7:27 PM, Bryant Menn wrote:
> > > I tried reading a Parquet file (<200MB, lots of text with snappy) using
> > > read_table and saw the memory usage peak over 8GB before settling back
> > down
> > > to ~200MB. This surprised me as I was expecting to be able to handle a
> > > Parquet file of this size with much less RAM (doing some processing with
> > > smaller VMs).
> > >
> > > I am not sure if this expected, but I thought I might check with everyone
> > > here and learn something new. Poking around it seems to be related with
> > > ParquetReader.read_all?
> > >
> > > Thanks in advance,
> > > Bryant
> >