git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Go] High memory usage on CSV read into table


hi Daniel,
On Sun, Nov 18, 2018 at 10:17 PM Daniel Harper <djharperuk@xxxxxxxxx> wrote:

> Sorry just realised SVG doesn't work.
>
> PNG of the pprof can be found here: https://i.imgur.com/BVXv1Jm.png
>
>
> Daniel Harper
> http://djhworld.github.io
>
>
> On Sun, 18 Nov 2018 at 21:07, Daniel Harper <djharperuk@xxxxxxxxx> wrote:
>
> > Wasn't sure where the best place to discuss this, but I've noticed that
> > when running the following piece of code
> >
> > https://play.golang.org/p/SKkqPWoHPPS
> >
> > On a CSV files that contains roughly 1 million records (about 100mb of
> > data), the memory usage of the process leaps to about 9.1GB
> >
> > The records look something like this
> >
> >
> >
> "2018-08-27T20:00:00Z","cdnA","dash","audio","http","programme-1","3577","2018","08","27","2018-08-27","live"
> >
> >
> "2018-08-27T20:00:01Z","cdnB","hls","video","https","programme-2","14","2018","08","27","2018-08-27","ondemand"
> >
> > I've attached a pprof output of the process.
> >
> > From the looks of it the heavy use of _strings_ might be where most of
> the
> > memory is going.
> >
> > Is this expected? I'm new to the code, happy to help where possible!
>

it's somewhat expected.

you use `io.ReadFile` to get your data.
this will read the whole file in memory and stick it there: so there's that.
for much bigger files, I would recommend using `os.Open`.

also, you don't release the individual records once passed to the table, so
you have a memory leak.
here is my current attempt:
- https://play.golang.org/p/ns3GJW6Wx3T

finally, as I was alluding to on the #data-science slack channel, right now
Go arrow/csv will create a new Record for each row in the incoming CSV file.
so you get a bunch of overhead for every row/record.

a much more efficient way would be to chunk `n` rows into a single Record.
an even more efficient way would be to create a dedicated csv.table type
that implements array.Table (as it seems you're interested in using that
interface) but only reads the incoming CSV file piecewise (ie: implementing
the chunking I was alluding to above but w/o having to load the whole
[]Record slice.)

as a first step to improve this issue, implementing chunking would already
shave off a bunch of overhead.

-s