Re: [Go] High memory usage on CSV read into table
On Sun, Nov 18, 2018 at 10:17 PM Daniel Harper <djharperuk@xxxxxxxxx> wrote:
> Sorry just realised SVG doesn't work.
> PNG of the pprof can be found here: https://i.imgur.com/BVXv1Jm.png
> Daniel Harper
> On Sun, 18 Nov 2018 at 21:07, Daniel Harper <djharperuk@xxxxxxxxx> wrote:
> > Wasn't sure where the best place to discuss this, but I've noticed that
> > when running the following piece of code
> > https://play.golang.org/p/SKkqPWoHPPS
> > On a CSV files that contains roughly 1 million records (about 100mb of
> > data), the memory usage of the process leaps to about 9.1GB
> > The records look something like this
> > I've attached a pprof output of the process.
> > From the looks of it the heavy use of _strings_ might be where most of
> > memory is going.
> > Is this expected? I'm new to the code, happy to help where possible!
it's somewhat expected.
you use `io.ReadFile` to get your data.
this will read the whole file in memory and stick it there: so there's that.
for much bigger files, I would recommend using `os.Open`.
also, you don't release the individual records once passed to the table, so
you have a memory leak.
here is my current attempt:
finally, as I was alluding to on the #data-science slack channel, right now
Go arrow/csv will create a new Record for each row in the incoming CSV file.
so you get a bunch of overhead for every row/record.
a much more efficient way would be to chunk `n` rows into a single Record.
an even more efficient way would be to create a dedicated csv.table type
that implements array.Table (as it seems you're interested in using that
interface) but only reads the incoming CSV file piecewise (ie: implementing
the chunking I was alluding to above but w/o having to load the whole
as a first step to improve this issue, implementing chunking would already
shave off a bunch of overhead.