Subject: RE: Performance issues of prepending a table

Hi Ian,

Thank you very much, that pretty much answers it.

Best regards,
Andre Medeiros
From: Ian Varley [[email protected]]
Sent: Wednesday, April 18, 2012 17:11
To: [email protected]
Subject: Re: Performance issues of prepending a table

I would guess that this approach would be susceptible to the same kind of "hot
spotting" as inserting sequential keys; if you're prepending globally (i.e.
there's one global "first" row), then all activity will be taking place on the
same region server, so you wouldn't be taking advantage of the natural
parallelism of a clustered system like HBase.

That aside, I can't think of anything architectural about HBase that would
making it perform poorly to be continually inserting rows that sort before
other rows; I think the log structured merge trees that hbase uses for storage
will handle any kind of insert activity more or less identically, and write to
the WAL and the memstore with equal speed regardless of row key position (and,
flushes to storefiles on disk are based on the sorted arrangement in memory,
which has already taken place by that point). There may be some smaller order
differences in the speed of inserting into the memstore, depending on position,
but that'd be something you'd have to benchmark, and my guess is you'd get
nothing discernible. But as always, the best way to know is to try it. :)


On Apr 18, 2012, at 8:59 AM, de Souza Medeiros Andre wrote:

Hi all,

For some specific reason, I have a HBase table that should be frequently
prepended. The row keys in this table are long integers (converted to bytes of
course). "Prepend" is an operation that does the following:
1. Scans the table just for the purpose of getting the row key X of the first
row, then stops the scan.
2. CheckAndSet on X-1, checking if row X-1 is null and putting data at row key
3. If the CAS failed, try CAS on X-2, etc.

I'd like to know if there are any obvious performance drawbacks with this
approach, compared to inserting rows randomly in the table. With "obvious
performance drawbacks" I mean something that doesn't need to be benchmarked to
know its effects.

I am aware that scanning plus CAS will be slower than a simple Put, but I'd
like to know if prepending has any negative effect regarding region management
and misc.

Thank you,
Andre Medeiros

Programming list archiving by: Enterprise Git Hosting