git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Peak memory usage for pyarrow.parquet.read_table


Uwe,

I'll try pinpointing things further with `columns=` and try to reproduce
what I find with data I can share.

Thanks for the pointer.

-Bryant

On Wed, Apr 25, 2018 at 2:10 PM Uwe L. Korn <uwelk@xxxxxxxxxx> wrote:

> No, there is no need to pass any options on reading. Sometimes they are
> beneficial depending on what you want to achieve but defaults are ok, too.
>
> I'm not sure if you're able to post an example but it would be nice if you
> could post the resulting Arrow schema from the table. It might be related
> to a specific type. A quick way to debug this on your side would also be to
> specify only a subset of columns to read using the `columns=` attribute on
> read_table. Maybe you can already pinpoint the memory problems to a
> specific column. Having these hints would it make easier for us to diagnose
> what the underlying problem is.
>
> Uwe
>
> On Wed, Apr 25, 2018, at 8:06 PM, Bryant Menn wrote:
> > Uwe,
> >
> > I am not. Should I be? I forgot to mention earlier that the Parquet file
> > came from Spark/PySpark.
> >
> > On Wed, Apr 25, 2018 at 1:32 PM Uwe L. Korn <uwelk@xxxxxxxxxx> wrote:
> >
> > > Hello Bryant,
> > >
> > > are you using any options on `pyarrow.parquet.read_table` or a possible
> > > `to_pandas` afterwards?
> > >
> > > Uwe
> > >
> > > On Wed, Apr 25, 2018, at 7:27 PM, Bryant Menn wrote:
> > > > I tried reading a Parquet file (<200MB, lots of text with snappy)
> using
> > > > read_table and saw the memory usage peak over 8GB before settling
> back
> > > down
> > > > to ~200MB. This surprised me as I was expecting to be able to handle
> a
> > > > Parquet file of this size with much less RAM (doing some processing
> with
> > > > smaller VMs).
> > > >
> > > > I am not sure if this expected, but I thought I might check with
> everyone
> > > > here and learn something new. Poking around it seems to be related
> with
> > > > ParquetReader.read_all?
> > > >
> > > > Thanks in advance,
> > > > Bryant
> > >
>


( ! ) Warning: include(msgfooter.php): failed to open stream: No such file or directory in /var/www/git/apache-arrow-development/msg04293.html on line 129
Call Stack
#TimeMemoryFunctionLocation
10.0006364600{main}( ).../msg04293.html:0

( ! ) Warning: include(): Failed opening 'msgfooter.php' for inclusion (include_path='.:/var/www/git') in /var/www/git/apache-arrow-development/msg04293.html on line 129
Call Stack
#TimeMemoryFunctionLocation
10.0006364600{main}( ).../msg04293.html:0