git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Go] High memory usage on CSV read into table


On Mon, Nov 19, 2018 at 11:29 PM Wes McKinney <wesmckinn@xxxxxxxxx> wrote:

> That seems buggy then. There is only 4.125 bytes of overhead per
> string value on average (a 32-bit offset, plus a valid bit)
> On Mon, Nov 19, 2018 at 5:02 PM Daniel Harper <djharperuk@xxxxxxxxx>
> wrote:
> >
> > Uncompressed
> >
> > $ ls -la concurrent_streams.csv
> > -rw-r--r-- 1 danielharper 112M Nov 16 19:21 concurrent_streams.csv
> >
> > $ wc -l concurrent_streams.csv
> >  1007481 concurrent_streams.csv
> >
> >
> > Daniel Harper
> > http://djhworld.github.io
> >
> >
> > On Mon, 19 Nov 2018 at 21:55, Wes McKinney <wesmckinn@xxxxxxxxx> wrote:
> >
> > > I'm curious how the file is only 100MB if it's producing ~6GB of
> > > strings in memory. Is it compressed?
> > > On Mon, Nov 19, 2018 at 4:48 PM Daniel Harper <djharperuk@xxxxxxxxx>
> > > wrote:
> > > >
> > > > Thanks,
> > > >
> > > > I've tried the new code and that seems to have shaved about 1GB of
> memory
> > > > off, so the heap is about 8.84GB now, here is the updated pprof
> output
> > > > https://i.imgur.com/itOHqBf.png
> > > >
> > > > It looks like the majority of allocations are in the
> memory.GoAllocator
> > > >
> > > > (pprof) top
> > > > Showing nodes accounting for 8.84GB, 100% of 8.84GB total
> > > > Showing top 10 nodes out of 41
> > > >       flat  flat%   sum%        cum   cum%
> > > >     4.24GB 47.91% 47.91%     4.24GB 47.91%
> > > > github.com/apache/arrow/go/arrow/memory.(*GoAllocator).Allocate
> > > >     2.12GB 23.97% 71.88%     2.12GB 23.97%
> > > > github.com/apache/arrow/go/arrow/memory.NewResizableBuffer (inline)
> > > >     1.07GB 12.07% 83.95%     1.07GB 12.07%
> > > > github.com/apache/arrow/go/arrow/array.NewData
> > > >     0.83GB  9.38% 93.33%     0.83GB  9.38%
> > > > github.com/apache/arrow/go/arrow/array.NewStringData
> > > >     0.33GB  3.69% 97.02%     1.31GB 14.79%
> > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).newData
> > > >     0.18GB  2.04% 99.06%     0.18GB  2.04%
> > > > github.com/apache/arrow/go/arrow/array.NewChunked
> > > >     0.07GB  0.78% 99.85%     0.07GB  0.78%
> > > > github.com/apache/arrow/go/arrow/array.NewInt64Data
> > > >     0.01GB  0.15%   100%     0.21GB  2.37%
> > > > github.com/apache/arrow/go/arrow/array.(*Int64Builder).newData
> > > >          0     0%   100%        6GB 67.91%
> > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Append
> > > >          0     0%   100%     4.03GB 45.54%
> > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Reserve
> > > >
> > > >
> > > > I'm a bit busy at the moment but I'll probably repeat the same test
> on
> > > the
> > > > other Arrow implementations (e.g. Java) to see if they allocate a
> similar
> > > > amount.
>

I've implemented chunking over there:

- https://github.com/apache/arrow/pull/3019

could you try with a couple of chunking values?
e.g.:
- csv.WithChunk(-1): reads the whole file into memory, creates one big
record
- csv.WithChunk(nrows/10): creates 10 records

also, it would be great to try to disentangle the memory usage of the "CSV
reading part" from the "Table creation" one:
- have some perf numbers w/o storing all these Records into a []Record
slice,
- have some perf numbers w/ only storing these Records into a []Record
slice,
- have some perf numbers w/ storing the records into the slice + creating
the Table.

hth,
-s