git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CASSANDRA-13241 lower default chunk_length_in_kb


Agree with Sylvain (and I think Benedict) - there’s no compelling reason to violate the freeze here. We’ve had the wrong default for years - add a note to the docs that we’ll be changing it in the future, but let’s not violate the freeze now.

-- 
Jeff Jirsa


> On Oct 19, 2018, at 10:06 AM, Sylvain Lebresne <lebresne@xxxxxxxxx> wrote:
> 
> Fwiw, as much as I agree this is a change worth doing in general, I do am
> -0 for 4.0. Both the "compact sequencing" and the change of default really.
> We're closing on 2 months within the freeze, and for me a freeze do include
> not changing defaults, because changing default ideally imply a decent
> amount of analysis/benchmark of the consequence of that change[1] and that
> doesn't enter in my definition of a freeze.
> 
> [1]: to be extra clear, I'm not saying we've always done this, far from it.
> But I hope we can all agree we were wrong to no do it when we didn't and
> should strive to improve, not repeat past mistakes.
> --
> Sylvain
> 
> 
>> On Thu, Oct 18, 2018 at 8:55 PM Ariel Weisberg <ariel@xxxxxxxxxxx> wrote:
>> 
>> Hi,
>> 
>> For those who were asking about the performance impact of block size on
>> compression I wrote a microbenchmark.
>> 
>> https://pastebin.com/RHDNLGdC
>> 
>>     [java] Benchmark                                               Mode
>> Cnt          Score          Error  Units
>>     [java] CompactIntegerSequenceBench.benchCompressLZ4Fast16k    thrpt
>> 15  331190055.685 ±  8079758.044  ops/s
>>     [java] CompactIntegerSequenceBench.benchCompressLZ4Fast32k    thrpt
>> 15  353024925.655 ±  7980400.003  ops/s
>>     [java] CompactIntegerSequenceBench.benchCompressLZ4Fast64k    thrpt
>> 15  365664477.654 ± 10083336.038  ops/s
>>     [java] CompactIntegerSequenceBench.benchCompressLZ4Fast8k     thrpt
>> 15  305518114.172 ± 11043705.883  ops/s
>>     [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast16k  thrpt
>> 15  688369529.911 ± 25620873.933  ops/s
>>     [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast32k  thrpt
>> 15  703635848.895 ±  5296941.704  ops/s
>>     [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast64k  thrpt
>> 15  695537044.676 ± 17400763.731  ops/s
>>     [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast8k   thrpt
>> 15  727725713.128 ±  4252436.331  ops/s
>> 
>> To summarize, compression is 8.5% slower and decompression is 1% faster.
>> This is measuring the impact on compression/decompression not the huge
>> impact that would occur if we decompressed data we don't need less often.
>> 
>> I didn't test decompression of Snappy and LZ4 high, but I did test
>> compression.
>> 
>> Snappy:
>>     [java] CompactIntegerSequenceBench.benchCompressSnappy16k   thrpt
>> 2  196574766.116          ops/s
>>     [java] CompactIntegerSequenceBench.benchCompressSnappy32k   thrpt
>> 2  198538643.844          ops/s
>>     [java] CompactIntegerSequenceBench.benchCompressSnappy64k   thrpt
>> 2  194600497.613          ops/s
>>     [java] CompactIntegerSequenceBench.benchCompressSnappy8k    thrpt
>> 2  186040175.059          ops/s
>> 
>> LZ4 high compressor:
>>     [java] CompactIntegerSequenceBench.bench16k thrpt    2
>> 20822947.578          ops/s
>>     [java] CompactIntegerSequenceBench.bench32k thrpt    2
>> 12037342.253          ops/s
>>     [java] CompactIntegerSequenceBench.bench64k  thrpt    2
>> 6782534.469          ops/s
>>     [java] CompactIntegerSequenceBench.bench8k   thrpt    2
>> 32254619.594          ops/s
>> 
>> LZ4 high is the one instance where block size mattered a lot. It's a bit
>> suspicious really when you look at the ratio of performance to block size
>> being close to 1:1. I couldn't spot a bug in the benchmark though.
>> 
>> Compression ratios with LZ4 fast for the text of Alice in Wonderland was:
>> 
>> Chunk size 8192, ratio 0.709473
>> Chunk size 16384, ratio 0.667236
>> Chunk size 32768, ratio 0.634735
>> Chunk size 65536, ratio 0.607208
>> 
>> By way of comparison I also ran deflate with maximum compression:
>> 
>> Chunk size 8192, ratio 0.426434
>> Chunk size 16384, ratio 0.402423
>> Chunk size 32768, ratio 0.381627
>> Chunk size 65536, ratio 0.364865
>> 
>> Ariel
>> 
>>> On Thu, Oct 18, 2018, at 5:32 AM, Benedict Elliott Smith wrote:
>>> FWIW, I’m not -0, just think that long after the freeze date a change
>>> like this needs a strong mandate from the community.  I think the change
>>> is a good one.
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On 17 Oct 2018, at 22:09, Ariel Weisberg <ariel@xxxxxxxxxxx> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> It's really not appreciably slower compared to the decompression we
>> are going to do which is going to take several microseconds. Decompression
>> is also going to be faster because we are going to do less unnecessary
>> decompression and the decompression itself may be faster since it may fit
>> in a higher level cache better. I ran a microbenchmark comparing them.
>>>> 
>>>> 
>> https://issues.apache.org/jira/browse/CASSANDRA-13241?focusedCommentId=16653988&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16653988
>>>> 
>>>> Fetching a long from memory:       56 nanoseconds
>>>> Compact integer sequence   :       80 nanoseconds
>>>> Summing integer sequence   :      165 nanoseconds
>>>> 
>>>> Currently we have one +1 from Kurt to change the representation and
>> possibly a -0 from Benedict. That's not really enough to make an exception
>> to the code freeze. If you want it to happen (or not) you need to speak up
>> otherwise only the default will change.
>>>> 
>>>> Regards,
>>>> Ariel
>>>> 
>>>>> On Wed, Oct 17, 2018, at 6:40 AM, kurt greaves wrote:
>>>>> I think if we're going to drop it to 16k, we should invest in the
>> compact
>>>>> sequencing as well. Just lowering it to 16k will have potentially a
>> painful
>>>>> impact on anyone running low memory nodes, but if we can do it
>> without the
>>>>> memory impact I don't think there's any reason to wait another major
>>>>> version to implement it.
>>>>> 
>>>>> Having said that, we should probably benchmark the two representations
>>>>> Ariel has come up with.
>>>>> 
>>>>> On Wed, 17 Oct 2018 at 20:17, Alain RODRIGUEZ <arodrime@xxxxxxxxx>
>> wrote:
>>>>> 
>>>>>> +1
>>>>>> 
>>>>>> I would guess a lot of C* clusters/tables have this option set to the
>>>>>> default value, and not many of them are having the need for reading
>> so big
>>>>>> chunks of data.
>>>>>> I believe this will greatly limit disk overreads for a fair amount
>> (a big
>>>>>> majority?) of new users. It seems fair enough to change this default
>> value,
>>>>>> I also think 4.0 is a nice place to do this.
>>>>>> 
>>>>>> Thanks for taking care of this Ariel and for making sure there is a
>>>>>> consensus here as well,
>>>>>> 
>>>>>> C*heers,
>>>>>> -----------------------
>>>>>> Alain Rodriguez - alain@xxxxxxxxxxxxxxxxx
>>>>>> France / Spain
>>>>>> 
>>>>>> The Last Pickle - Apache Cassandra Consulting
>>>>>> http://www.thelastpickle.com
>>>>>> 
>>>>>> Le sam. 13 oct. 2018 à 08:52, Ariel Weisberg <ariel@xxxxxxxxxxx> a
>> écrit :
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> This would only impact new tables, existing tables would get their
>>>>>>> chunk_length_in_kb from the existing schema. It's something we
>> record in
>>>>>> a
>>>>>>> system table.
>>>>>>> 
>>>>>>> I have an implementation of a compact integer sequence that only
>> requires
>>>>>>> 37% of the memory required today. So we could do this with only
>> slightly
>>>>>>> more than doubling the memory used. I'll post that to the JIRA soon.
>>>>>>> 
>>>>>>> Ariel
>>>>>>> 
>>>>>>>> On Fri, Oct 12, 2018, at 1:56 AM, Jeff Jirsa wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I think 16k is a better default, but it should only affect new
>> tables.
>>>>>>>> Whoever changes it, please make sure you think about the upgrade
>> path.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Oct 12, 2018, at 2:31 AM, Ben Bromhead <ben@xxxxxxxxxxxxxxx>
>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> This is something that's bugged me for ages, tbh the performance
>> gain
>>>>>>> for
>>>>>>>>> most use cases far outweighs the increase in memory usage and I
>> would
>>>>>>> even
>>>>>>>>> be in favor of changing the default now, optimizing the storage
>> cost
>>>>>>> later
>>>>>>>>> (if it's found to be worth it).
>>>>>>>>> 
>>>>>>>>> For some anecdotal evidence:
>>>>>>>>> 4kb is usually what we end setting it to, 16kb feels more
>> reasonable
>>>>>>> given
>>>>>>>>> the memory impact, but what would be the point if practically,
>> most
>>>>>>> folks
>>>>>>>>> set it to 4kb anyway?
>>>>>>>>> 
>>>>>>>>> Note that chunk_length will largely be dependent on your read
>> sizes,
>>>>>>> but 4k
>>>>>>>>> is the floor for most physical devices in terms of ones block
>> size.
>>>>>>>>> 
>>>>>>>>> +1 for making this change in 4.0 given the small size and the
>> large
>>>>>>>>> improvement to new users experience (as long as we are explicit in
>>>>>> the
>>>>>>>>> documentation about memory consumption).
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Thu, Oct 11, 2018 at 7:11 PM Ariel Weisberg <
>> ariel@xxxxxxxxxxx>
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> This is regarding
>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-13241
>>>>>>>>>> 
>>>>>>>>>> This ticket has languished for a while. IMO it's too late in 4.0
>> to
>>>>>>>>>> implement a more memory efficient representation for compressed
>>>>>> chunk
>>>>>>>>>> offsets. However I don't think we should put out another release
>>>>>> with
>>>>>>> the
>>>>>>>>>> current 64k default as it's pretty unreasonable.
>>>>>>>>>> 
>>>>>>>>>> I propose that we lower the value to 16kb. 4k might never be the
>>>>>>> correct
>>>>>>>>>> default anyways as there is a cost to compression and 16k will
>> still
>>>>>>> be a
>>>>>>>>>> large improvement.
>>>>>>>>>> 
>>>>>>>>>> Benedict and Jon Haddad are both +1 on making this change for
>> 4.0.
>>>>>> In
>>>>>>> the
>>>>>>>>>> past there has been some consensus about reducing this value
>>>>>> although
>>>>>>> maybe
>>>>>>>>>> with more memory efficiency.
>>>>>>>>>> 
>>>>>>>>>> The napkin math for what this costs is:
>>>>>>>>>> "If you have 1TB of uncompressed data, with 64k chunks that's 16M
>>>>>>> chunks
>>>>>>>>>> at 8 bytes each (128MB).
>>>>>>>>>> With 16k chunks, that's 512MB.
>>>>>>>>>> With 4k chunks, it's 2G.
>>>>>>>>>> Per terabyte of data (pre-compression)."
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>>> 
>> https://issues.apache.org/jira/browse/CASSANDRA-13241?focusedCommentId=15886621&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15886621
>>>>>>>>>> 
>>>>>>>>>> By way of comparison memory mapping the files has a similar cost
>> per
>>>>>>> 4k
>>>>>>>>>> page of 8 bytes. Multiple mappings makes this more expensive.
>> With a
>>>>>>>>>> default of 16kb this would be 4x less expensive than memory
>> mapping
>>>>>> a
>>>>>>> file.
>>>>>>>>>> I only mention this to give a sense of the costs we are already
>>>>>>> paying. I
>>>>>>>>>> am not saying they are directly related.
>>>>>>>>>> 
>>>>>>>>>> I'll wait a week for discussion and if there is consensus make
>> the
>>>>>>> change.
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Ariel
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
>>>>>>>>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>> Ben Bromhead
>>>>>>>>> CTO | Instaclustr <https://www.instaclustr.com/>
>>>>>>>>> +1 650 284 9692
>>>>>>>>> Reliability at Scale
>>>>>>>>> Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer
>>>>>>>> 
>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
>>>>>>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
>>>>>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
>>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
>>>> 
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
>> 
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx