git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Go] High memory usage on CSV read into table


That seems buggy then. There is only 4.125 bytes of overhead per
string value on average (a 32-bit offset, plus a valid bit)
On Mon, Nov 19, 2018 at 5:02 PM Daniel Harper <djharperuk@xxxxxxxxx> wrote:
>
> Uncompressed
>
> $ ls -la concurrent_streams.csv
> -rw-r--r-- 1 danielharper 112M Nov 16 19:21 concurrent_streams.csv
>
> $ wc -l concurrent_streams.csv
>  1007481 concurrent_streams.csv
>
>
> Daniel Harper
> http://djhworld.github.io
>
>
> On Mon, 19 Nov 2018 at 21:55, Wes McKinney <wesmckinn@xxxxxxxxx> wrote:
>
> > I'm curious how the file is only 100MB if it's producing ~6GB of
> > strings in memory. Is it compressed?
> > On Mon, Nov 19, 2018 at 4:48 PM Daniel Harper <djharperuk@xxxxxxxxx>
> > wrote:
> > >
> > > Thanks,
> > >
> > > I've tried the new code and that seems to have shaved about 1GB of memory
> > > off, so the heap is about 8.84GB now, here is the updated pprof output
> > > https://i.imgur.com/itOHqBf.png
> > >
> > > It looks like the majority of allocations are in the memory.GoAllocator
> > >
> > > (pprof) top
> > > Showing nodes accounting for 8.84GB, 100% of 8.84GB total
> > > Showing top 10 nodes out of 41
> > >       flat  flat%   sum%        cum   cum%
> > >     4.24GB 47.91% 47.91%     4.24GB 47.91%
> > > github.com/apache/arrow/go/arrow/memory.(*GoAllocator).Allocate
> > >     2.12GB 23.97% 71.88%     2.12GB 23.97%
> > > github.com/apache/arrow/go/arrow/memory.NewResizableBuffer (inline)
> > >     1.07GB 12.07% 83.95%     1.07GB 12.07%
> > > github.com/apache/arrow/go/arrow/array.NewData
> > >     0.83GB  9.38% 93.33%     0.83GB  9.38%
> > > github.com/apache/arrow/go/arrow/array.NewStringData
> > >     0.33GB  3.69% 97.02%     1.31GB 14.79%
> > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).newData
> > >     0.18GB  2.04% 99.06%     0.18GB  2.04%
> > > github.com/apache/arrow/go/arrow/array.NewChunked
> > >     0.07GB  0.78% 99.85%     0.07GB  0.78%
> > > github.com/apache/arrow/go/arrow/array.NewInt64Data
> > >     0.01GB  0.15%   100%     0.21GB  2.37%
> > > github.com/apache/arrow/go/arrow/array.(*Int64Builder).newData
> > >          0     0%   100%        6GB 67.91%
> > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Append
> > >          0     0%   100%     4.03GB 45.54%
> > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Reserve
> > >
> > >
> > > I'm a bit busy at the moment but I'll probably repeat the same test on
> > the
> > > other Arrow implementations (e.g. Java) to see if they allocate a similar
> > > amount.
> > >
> > >
> > > Daniel Harper
> > > http://djhworld.github.io
> > >
> > >
> > > On Mon, 19 Nov 2018 at 10:17, Sebastien Binet <binet@xxxxxxx> wrote:
> > >
> > > > hi Daniel,
> > > > On Sun, Nov 18, 2018 at 10:17 PM Daniel Harper <djharperuk@xxxxxxxxx>
> > > > wrote:
> > > >
> > > > > Sorry just realised SVG doesn't work.
> > > > >
> > > > > PNG of the pprof can be found here: https://i.imgur.com/BVXv1Jm.png
> > > > >
> > > > >
> > > > > Daniel Harper
> > > > > http://djhworld.github.io
> > > > >
> > > > >
> > > > > On Sun, 18 Nov 2018 at 21:07, Daniel Harper <djharperuk@xxxxxxxxx>
> > > > wrote:
> > > > >
> > > > > > Wasn't sure where the best place to discuss this, but I've noticed
> > that
> > > > > > when running the following piece of code
> > > > > >
> > > > > > https://play.golang.org/p/SKkqPWoHPPS
> > > > > >
> > > > > > On a CSV files that contains roughly 1 million records (about
> > 100mb of
> > > > > > data), the memory usage of the process leaps to about 9.1GB
> > > > > >
> > > > > > The records look something like this
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > "2018-08-27T20:00:00Z","cdnA","dash","audio","http","programme-1","3577","2018","08","27","2018-08-27","live"
> > > > > >
> > > > > >
> > > > >
> > > >
> > "2018-08-27T20:00:01Z","cdnB","hls","video","https","programme-2","14","2018","08","27","2018-08-27","ondemand"
> > > > > >
> > > > > > I've attached a pprof output of the process.
> > > > > >
> > > > > > From the looks of it the heavy use of _strings_ might be where
> > most of
> > > > > the
> > > > > > memory is going.
> > > > > >
> > > > > > Is this expected? I'm new to the code, happy to help where
> > possible!
> > > > >
> > > >
> > > > it's somewhat expected.
> > > >
> > > > you use `io.ReadFile` to get your data.
> > > > this will read the whole file in memory and stick it there: so there's
> > > > that.
> > > > for much bigger files, I would recommend using `os.Open`.
> > > >
> > > > also, you don't release the individual records once passed to the
> > table, so
> > > > you have a memory leak.
> > > > here is my current attempt:
> > > > - https://play.golang.org/p/ns3GJW6Wx3T
> > > >
> > > > finally, as I was alluding to on the #data-science slack channel,
> > right now
> > > > Go arrow/csv will create a new Record for each row in the incoming CSV
> > > > file.
> > > > so you get a bunch of overhead for every row/record.
> > > >
> > > > a much more efficient way would be to chunk `n` rows into a single
> > Record.
> > > > an even more efficient way would be to create a dedicated csv.table
> > type
> > > > that implements array.Table (as it seems you're interested in using
> > that
> > > > interface) but only reads the incoming CSV file piecewise (ie:
> > implementing
> > > > the chunking I was alluding to above but w/o having to load the whole
> > > > []Record slice.)
> > > >
> > > > as a first step to improve this issue, implementing chunking would
> > already
> > > > shave off a bunch of overhead.
> > > >
> > > > -s
> > > >
> >