git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Go] High memory usage on CSV read into table


Thanks,

I've tried the new code and that seems to have shaved about 1GB of memory
off, so the heap is about 8.84GB now, here is the updated pprof output
https://i.imgur.com/itOHqBf.png

It looks like the majority of allocations are in the memory.GoAllocator

(pprof) top
Showing nodes accounting for 8.84GB, 100% of 8.84GB total
Showing top 10 nodes out of 41
      flat  flat%   sum%        cum   cum%
    4.24GB 47.91% 47.91%     4.24GB 47.91%
github.com/apache/arrow/go/arrow/memory.(*GoAllocator).Allocate
    2.12GB 23.97% 71.88%     2.12GB 23.97%
github.com/apache/arrow/go/arrow/memory.NewResizableBuffer (inline)
    1.07GB 12.07% 83.95%     1.07GB 12.07%
github.com/apache/arrow/go/arrow/array.NewData
    0.83GB  9.38% 93.33%     0.83GB  9.38%
github.com/apache/arrow/go/arrow/array.NewStringData
    0.33GB  3.69% 97.02%     1.31GB 14.79%
github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).newData
    0.18GB  2.04% 99.06%     0.18GB  2.04%
github.com/apache/arrow/go/arrow/array.NewChunked
    0.07GB  0.78% 99.85%     0.07GB  0.78%
github.com/apache/arrow/go/arrow/array.NewInt64Data
    0.01GB  0.15%   100%     0.21GB  2.37%
github.com/apache/arrow/go/arrow/array.(*Int64Builder).newData
         0     0%   100%        6GB 67.91%
github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Append
         0     0%   100%     4.03GB 45.54%
github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Reserve


I'm a bit busy at the moment but I'll probably repeat the same test on the
other Arrow implementations (e.g. Java) to see if they allocate a similar
amount.


Daniel Harper
http://djhworld.github.io


On Mon, 19 Nov 2018 at 10:17, Sebastien Binet <binet@xxxxxxx> wrote:

> hi Daniel,
> On Sun, Nov 18, 2018 at 10:17 PM Daniel Harper <djharperuk@xxxxxxxxx>
> wrote:
>
> > Sorry just realised SVG doesn't work.
> >
> > PNG of the pprof can be found here: https://i.imgur.com/BVXv1Jm.png
> >
> >
> > Daniel Harper
> > http://djhworld.github.io
> >
> >
> > On Sun, 18 Nov 2018 at 21:07, Daniel Harper <djharperuk@xxxxxxxxx>
> wrote:
> >
> > > Wasn't sure where the best place to discuss this, but I've noticed that
> > > when running the following piece of code
> > >
> > > https://play.golang.org/p/SKkqPWoHPPS
> > >
> > > On a CSV files that contains roughly 1 million records (about 100mb of
> > > data), the memory usage of the process leaps to about 9.1GB
> > >
> > > The records look something like this
> > >
> > >
> > >
> >
> "2018-08-27T20:00:00Z","cdnA","dash","audio","http","programme-1","3577","2018","08","27","2018-08-27","live"
> > >
> > >
> >
> "2018-08-27T20:00:01Z","cdnB","hls","video","https","programme-2","14","2018","08","27","2018-08-27","ondemand"
> > >
> > > I've attached a pprof output of the process.
> > >
> > > From the looks of it the heavy use of _strings_ might be where most of
> > the
> > > memory is going.
> > >
> > > Is this expected? I'm new to the code, happy to help where possible!
> >
>
> it's somewhat expected.
>
> you use `io.ReadFile` to get your data.
> this will read the whole file in memory and stick it there: so there's
> that.
> for much bigger files, I would recommend using `os.Open`.
>
> also, you don't release the individual records once passed to the table, so
> you have a memory leak.
> here is my current attempt:
> - https://play.golang.org/p/ns3GJW6Wx3T
>
> finally, as I was alluding to on the #data-science slack channel, right now
> Go arrow/csv will create a new Record for each row in the incoming CSV
> file.
> so you get a bunch of overhead for every row/record.
>
> a much more efficient way would be to chunk `n` rows into a single Record.
> an even more efficient way would be to create a dedicated csv.table type
> that implements array.Table (as it seems you're interested in using that
> interface) but only reads the incoming CSV file piecewise (ie: implementing
> the chunking I was alluding to above but w/o having to load the whole
> []Record slice.)
>
> as a first step to improve this issue, implementing chunking would already
> shave off a bunch of overhead.
>
> -s
>