git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CASSANDRA-13241 lower default chunk_length_in_kb


Fwiw, as much as I agree this is a change worth doing in general, I do am
-0 for 4.0. Both the "compact sequencing" and the change of default really.
We're closing on 2 months within the freeze, and for me a freeze do include
not changing defaults, because changing default ideally imply a decent
amount of analysis/benchmark of the consequence of that change[1] and that
doesn't enter in my definition of a freeze.

[1]: to be extra clear, I'm not saying we've always done this, far from it.
But I hope we can all agree we were wrong to no do it when we didn't and
should strive to improve, not repeat past mistakes.
--
Sylvain


On Thu, Oct 18, 2018 at 8:55 PM Ariel Weisberg <ariel@xxxxxxxxxxx> wrote:

> Hi,
>
> For those who were asking about the performance impact of block size on
> compression I wrote a microbenchmark.
>
> https://pastebin.com/RHDNLGdC
>
>      [java] Benchmark                                               Mode
> Cnt          Score          Error  Units
>      [java] CompactIntegerSequenceBench.benchCompressLZ4Fast16k    thrpt
>  15  331190055.685 ±  8079758.044  ops/s
>      [java] CompactIntegerSequenceBench.benchCompressLZ4Fast32k    thrpt
>  15  353024925.655 ±  7980400.003  ops/s
>      [java] CompactIntegerSequenceBench.benchCompressLZ4Fast64k    thrpt
>  15  365664477.654 ± 10083336.038  ops/s
>      [java] CompactIntegerSequenceBench.benchCompressLZ4Fast8k     thrpt
>  15  305518114.172 ± 11043705.883  ops/s
>      [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast16k  thrpt
>  15  688369529.911 ± 25620873.933  ops/s
>      [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast32k  thrpt
>  15  703635848.895 ±  5296941.704  ops/s
>      [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast64k  thrpt
>  15  695537044.676 ± 17400763.731  ops/s
>      [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast8k   thrpt
>  15  727725713.128 ±  4252436.331  ops/s
>
> To summarize, compression is 8.5% slower and decompression is 1% faster.
> This is measuring the impact on compression/decompression not the huge
> impact that would occur if we decompressed data we don't need less often.
>
> I didn't test decompression of Snappy and LZ4 high, but I did test
> compression.
>
> Snappy:
>      [java] CompactIntegerSequenceBench.benchCompressSnappy16k   thrpt
> 2  196574766.116          ops/s
>      [java] CompactIntegerSequenceBench.benchCompressSnappy32k   thrpt
> 2  198538643.844          ops/s
>      [java] CompactIntegerSequenceBench.benchCompressSnappy64k   thrpt
> 2  194600497.613          ops/s
>      [java] CompactIntegerSequenceBench.benchCompressSnappy8k    thrpt
> 2  186040175.059          ops/s
>
> LZ4 high compressor:
>      [java] CompactIntegerSequenceBench.bench16k  thrpt    2
> 20822947.578          ops/s
>      [java] CompactIntegerSequenceBench.bench32k  thrpt    2
> 12037342.253          ops/s
>      [java] CompactIntegerSequenceBench.bench64k  thrpt    2
>  6782534.469          ops/s
>      [java] CompactIntegerSequenceBench.bench8k   thrpt    2
> 32254619.594          ops/s
>
> LZ4 high is the one instance where block size mattered a lot. It's a bit
> suspicious really when you look at the ratio of performance to block size
> being close to 1:1. I couldn't spot a bug in the benchmark though.
>
> Compression ratios with LZ4 fast for the text of Alice in Wonderland was:
>
> Chunk size 8192, ratio 0.709473
> Chunk size 16384, ratio 0.667236
> Chunk size 32768, ratio 0.634735
> Chunk size 65536, ratio 0.607208
>
> By way of comparison I also ran deflate with maximum compression:
>
> Chunk size 8192, ratio 0.426434
> Chunk size 16384, ratio 0.402423
> Chunk size 32768, ratio 0.381627
> Chunk size 65536, ratio 0.364865
>
> Ariel
>
> On Thu, Oct 18, 2018, at 5:32 AM, Benedict Elliott Smith wrote:
> > FWIW, I’m not -0, just think that long after the freeze date a change
> > like this needs a strong mandate from the community.  I think the change
> > is a good one.
> >
> >
> >
> >
> >
> > > On 17 Oct 2018, at 22:09, Ariel Weisberg <ariel@xxxxxxxxxxx> wrote:
> > >
> > > Hi,
> > >
> > > It's really not appreciably slower compared to the decompression we
> are going to do which is going to take several microseconds. Decompression
> is also going to be faster because we are going to do less unnecessary
> decompression and the decompression itself may be faster since it may fit
> in a higher level cache better. I ran a microbenchmark comparing them.
> > >
> > >
> https://issues.apache.org/jira/browse/CASSANDRA-13241?focusedCommentId=16653988&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16653988
> > >
> > > Fetching a long from memory:       56 nanoseconds
> > > Compact integer sequence   :       80 nanoseconds
> > > Summing integer sequence   :      165 nanoseconds
> > >
> > > Currently we have one +1 from Kurt to change the representation and
> possibly a -0 from Benedict. That's not really enough to make an exception
> to the code freeze. If you want it to happen (or not) you need to speak up
> otherwise only the default will change.
> > >
> > > Regards,
> > > Ariel
> > >
> > > On Wed, Oct 17, 2018, at 6:40 AM, kurt greaves wrote:
> > >> I think if we're going to drop it to 16k, we should invest in the
> compact
> > >> sequencing as well. Just lowering it to 16k will have potentially a
> painful
> > >> impact on anyone running low memory nodes, but if we can do it
> without the
> > >> memory impact I don't think there's any reason to wait another major
> > >> version to implement it.
> > >>
> > >> Having said that, we should probably benchmark the two representations
> > >> Ariel has come up with.
> > >>
> > >> On Wed, 17 Oct 2018 at 20:17, Alain RODRIGUEZ <arodrime@xxxxxxxxx>
> wrote:
> > >>
> > >>> +1
> > >>>
> > >>> I would guess a lot of C* clusters/tables have this option set to the
> > >>> default value, and not many of them are having the need for reading
> so big
> > >>> chunks of data.
> > >>> I believe this will greatly limit disk overreads for a fair amount
> (a big
> > >>> majority?) of new users. It seems fair enough to change this default
> value,
> > >>> I also think 4.0 is a nice place to do this.
> > >>>
> > >>> Thanks for taking care of this Ariel and for making sure there is a
> > >>> consensus here as well,
> > >>>
> > >>> C*heers,
> > >>> -----------------------
> > >>> Alain Rodriguez - alain@xxxxxxxxxxxxxxxxx
> > >>> France / Spain
> > >>>
> > >>> The Last Pickle - Apache Cassandra Consulting
> > >>> http://www.thelastpickle.com
> > >>>
> > >>> Le sam. 13 oct. 2018 à 08:52, Ariel Weisberg <ariel@xxxxxxxxxxx> a
> écrit :
> > >>>
> > >>>> Hi,
> > >>>>
> > >>>> This would only impact new tables, existing tables would get their
> > >>>> chunk_length_in_kb from the existing schema. It's something we
> record in
> > >>> a
> > >>>> system table.
> > >>>>
> > >>>> I have an implementation of a compact integer sequence that only
> requires
> > >>>> 37% of the memory required today. So we could do this with only
> slightly
> > >>>> more than doubling the memory used. I'll post that to the JIRA soon.
> > >>>>
> > >>>> Ariel
> > >>>>
> > >>>> On Fri, Oct 12, 2018, at 1:56 AM, Jeff Jirsa wrote:
> > >>>>>
> > >>>>>
> > >>>>> I think 16k is a better default, but it should only affect new
> tables.
> > >>>>> Whoever changes it, please make sure you think about the upgrade
> path.
> > >>>>>
> > >>>>>
> > >>>>>> On Oct 12, 2018, at 2:31 AM, Ben Bromhead <ben@xxxxxxxxxxxxxxx>
> > >>> wrote:
> > >>>>>>
> > >>>>>> This is something that's bugged me for ages, tbh the performance
> gain
> > >>>> for
> > >>>>>> most use cases far outweighs the increase in memory usage and I
> would
> > >>>> even
> > >>>>>> be in favor of changing the default now, optimizing the storage
> cost
> > >>>> later
> > >>>>>> (if it's found to be worth it).
> > >>>>>>
> > >>>>>> For some anecdotal evidence:
> > >>>>>> 4kb is usually what we end setting it to, 16kb feels more
> reasonable
> > >>>> given
> > >>>>>> the memory impact, but what would be the point if practically,
> most
> > >>>> folks
> > >>>>>> set it to 4kb anyway?
> > >>>>>>
> > >>>>>> Note that chunk_length will largely be dependent on your read
> sizes,
> > >>>> but 4k
> > >>>>>> is the floor for most physical devices in terms of ones block
> size.
> > >>>>>>
> > >>>>>> +1 for making this change in 4.0 given the small size and the
> large
> > >>>>>> improvement to new users experience (as long as we are explicit in
> > >>> the
> > >>>>>> documentation about memory consumption).
> > >>>>>>
> > >>>>>>
> > >>>>>>> On Thu, Oct 11, 2018 at 7:11 PM Ariel Weisberg <
> ariel@xxxxxxxxxxx>
> > >>>> wrote:
> > >>>>>>>
> > >>>>>>> Hi,
> > >>>>>>>
> > >>>>>>> This is regarding
> > >>>> https://issues.apache.org/jira/browse/CASSANDRA-13241
> > >>>>>>>
> > >>>>>>> This ticket has languished for a while. IMO it's too late in 4.0
> to
> > >>>>>>> implement a more memory efficient representation for compressed
> > >>> chunk
> > >>>>>>> offsets. However I don't think we should put out another release
> > >>> with
> > >>>> the
> > >>>>>>> current 64k default as it's pretty unreasonable.
> > >>>>>>>
> > >>>>>>> I propose that we lower the value to 16kb. 4k might never be the
> > >>>> correct
> > >>>>>>> default anyways as there is a cost to compression and 16k will
> still
> > >>>> be a
> > >>>>>>> large improvement.
> > >>>>>>>
> > >>>>>>> Benedict and Jon Haddad are both +1 on making this change for
> 4.0.
> > >>> In
> > >>>> the
> > >>>>>>> past there has been some consensus about reducing this value
> > >>> although
> > >>>> maybe
> > >>>>>>> with more memory efficiency.
> > >>>>>>>
> > >>>>>>> The napkin math for what this costs is:
> > >>>>>>> "If you have 1TB of uncompressed data, with 64k chunks that's 16M
> > >>>> chunks
> > >>>>>>> at 8 bytes each (128MB).
> > >>>>>>> With 16k chunks, that's 512MB.
> > >>>>>>> With 4k chunks, it's 2G.
> > >>>>>>> Per terabyte of data (pre-compression)."
> > >>>>>>>
> > >>>>>>>
> > >>>>
> > >>>
> https://issues.apache.org/jira/browse/CASSANDRA-13241?focusedCommentId=15886621&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15886621
> > >>>>>>>
> > >>>>>>> By way of comparison memory mapping the files has a similar cost
> per
> > >>>> 4k
> > >>>>>>> page of 8 bytes. Multiple mappings makes this more expensive.
> With a
> > >>>>>>> default of 16kb this would be 4x less expensive than memory
> mapping
> > >>> a
> > >>>> file.
> > >>>>>>> I only mention this to give a sense of the costs we are already
> > >>>> paying. I
> > >>>>>>> am not saying they are directly related.
> > >>>>>>>
> > >>>>>>> I'll wait a week for discussion and if there is consensus make
> the
> > >>>> change.
> > >>>>>>>
> > >>>>>>> Regards,
> > >>>>>>> Ariel
> > >>>>>>>
> > >>>>>>>
> > >>> ---------------------------------------------------------------------
> > >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
> > >>>>>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
> > >>>>>>>
> > >>>>>>> --
> > >>>>>> Ben Bromhead
> > >>>>>> CTO | Instaclustr <https://www.instaclustr.com/>
> > >>>>>> +1 650 284 9692
> > >>>>>> Reliability at Scale
> > >>>>>> Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer
> > >>>>>
> > >>>>>
> ---------------------------------------------------------------------
> > >>>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
> > >>>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
> > >>>>>
> > >>>>
> > >>>>
> ---------------------------------------------------------------------
> > >>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
> > >>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
> > >>>>
> > >>>>
> > >>>
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
> > > For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
> > For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
>
>