git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CASSANDRA-13241 lower default chunk_length_in_kb


The change of default property doesn’t seem to violate the freeze?  The predominant phrased used in that thread was 'feature freeze'.  A lot of people are now interpreting it more broadly, so perhaps we need to revisit, but that’s probably a separate discussion?

The current default is really bad for most users, so I’m +1 changing it.  Especially as the last time this topic was raised was (iirc) around the 3.0 freeze.  We decided not to change anything for similar reasons, and haven't revisited it since.


> On 19 Oct 2018, at 09:25, Jeff Jirsa <jjirsa@xxxxxxxxx> wrote:
> 
> Agree with Sylvain (and I think Benedict) - there’s no compelling reason to violate the freeze here. We’ve had the wrong default for years - add a note to the docs that we’ll be changing it in the future, but let’s not violate the freeze now.
> 
> -- 
> Jeff Jirsa
> 
> 
>> On Oct 19, 2018, at 10:06 AM, Sylvain Lebresne <lebresne@xxxxxxxxx> wrote:
>> 
>> Fwiw, as much as I agree this is a change worth doing in general, I do am
>> -0 for 4.0. Both the "compact sequencing" and the change of default really.
>> We're closing on 2 months within the freeze, and for me a freeze do include
>> not changing defaults, because changing default ideally imply a decent
>> amount of analysis/benchmark of the consequence of that change[1] and that
>> doesn't enter in my definition of a freeze.
>> 
>> [1]: to be extra clear, I'm not saying we've always done this, far from it.
>> But I hope we can all agree we were wrong to no do it when we didn't and
>> should strive to improve, not repeat past mistakes.
>> --
>> Sylvain
>> 
>> 
>>> On Thu, Oct 18, 2018 at 8:55 PM Ariel Weisberg <ariel@xxxxxxxxxxx> wrote:
>>> 
>>> Hi,
>>> 
>>> For those who were asking about the performance impact of block size on
>>> compression I wrote a microbenchmark.
>>> 
>>> https://pastebin.com/RHDNLGdC
>>> 
>>>    [java] Benchmark                                               Mode
>>> Cnt          Score          Error  Units
>>>    [java] CompactIntegerSequenceBench.benchCompressLZ4Fast16k    thrpt
>>> 15  331190055.685 ±  8079758.044  ops/s
>>>    [java] CompactIntegerSequenceBench.benchCompressLZ4Fast32k    thrpt
>>> 15  353024925.655 ±  7980400.003  ops/s
>>>    [java] CompactIntegerSequenceBench.benchCompressLZ4Fast64k    thrpt
>>> 15  365664477.654 ± 10083336.038  ops/s
>>>    [java] CompactIntegerSequenceBench.benchCompressLZ4Fast8k     thrpt
>>> 15  305518114.172 ± 11043705.883  ops/s
>>>    [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast16k  thrpt
>>> 15  688369529.911 ± 25620873.933  ops/s
>>>    [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast32k  thrpt
>>> 15  703635848.895 ±  5296941.704  ops/s
>>>    [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast64k  thrpt
>>> 15  695537044.676 ± 17400763.731  ops/s
>>>    [java] CompactIntegerSequenceBench.benchDecompressLZ4Fast8k   thrpt
>>> 15  727725713.128 ±  4252436.331  ops/s
>>> 
>>> To summarize, compression is 8.5% slower and decompression is 1% faster.
>>> This is measuring the impact on compression/decompression not the huge
>>> impact that would occur if we decompressed data we don't need less often.
>>> 
>>> I didn't test decompression of Snappy and LZ4 high, but I did test
>>> compression.
>>> 
>>> Snappy:
>>>    [java] CompactIntegerSequenceBench.benchCompressSnappy16k   thrpt
>>> 2  196574766.116          ops/s
>>>    [java] CompactIntegerSequenceBench.benchCompressSnappy32k   thrpt
>>> 2  198538643.844          ops/s
>>>    [java] CompactIntegerSequenceBench.benchCompressSnappy64k   thrpt
>>> 2  194600497.613          ops/s
>>>    [java] CompactIntegerSequenceBench.benchCompressSnappy8k    thrpt
>>> 2  186040175.059          ops/s
>>> 
>>> LZ4 high compressor:
>>>    [java] CompactIntegerSequenceBench.bench16k thrpt    2
>>> 20822947.578          ops/s
>>>    [java] CompactIntegerSequenceBench.bench32k thrpt    2
>>> 12037342.253          ops/s
>>>    [java] CompactIntegerSequenceBench.bench64k  thrpt    2
>>> 6782534.469          ops/s
>>>    [java] CompactIntegerSequenceBench.bench8k   thrpt    2
>>> 32254619.594          ops/s
>>> 
>>> LZ4 high is the one instance where block size mattered a lot. It's a bit
>>> suspicious really when you look at the ratio of performance to block size
>>> being close to 1:1. I couldn't spot a bug in the benchmark though.
>>> 
>>> Compression ratios with LZ4 fast for the text of Alice in Wonderland was:
>>> 
>>> Chunk size 8192, ratio 0.709473
>>> Chunk size 16384, ratio 0.667236
>>> Chunk size 32768, ratio 0.634735
>>> Chunk size 65536, ratio 0.607208
>>> 
>>> By way of comparison I also ran deflate with maximum compression:
>>> 
>>> Chunk size 8192, ratio 0.426434
>>> Chunk size 16384, ratio 0.402423
>>> Chunk size 32768, ratio 0.381627
>>> Chunk size 65536, ratio 0.364865
>>> 
>>> Ariel
>>> 
>>>> On Thu, Oct 18, 2018, at 5:32 AM, Benedict Elliott Smith wrote:
>>>> FWIW, I’m not -0, just think that long after the freeze date a change
>>>> like this needs a strong mandate from the community.  I think the change
>>>> is a good one.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 17 Oct 2018, at 22:09, Ariel Weisberg <ariel@xxxxxxxxxxx> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> It's really not appreciably slower compared to the decompression we
>>> are going to do which is going to take several microseconds. Decompression
>>> is also going to be faster because we are going to do less unnecessary
>>> decompression and the decompression itself may be faster since it may fit
>>> in a higher level cache better. I ran a microbenchmark comparing them.
>>>>> 
>>>>> 
>>> https://issues.apache.org/jira/browse/CASSANDRA-13241?focusedCommentId=16653988&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16653988
>>>>> 
>>>>> Fetching a long from memory:       56 nanoseconds
>>>>> Compact integer sequence   :       80 nanoseconds
>>>>> Summing integer sequence   :      165 nanoseconds
>>>>> 
>>>>> Currently we have one +1 from Kurt to change the representation and
>>> possibly a -0 from Benedict. That's not really enough to make an exception
>>> to the code freeze. If you want it to happen (or not) you need to speak up
>>> otherwise only the default will change.
>>>>> 
>>>>> Regards,
>>>>> Ariel
>>>>> 
>>>>>> On Wed, Oct 17, 2018, at 6:40 AM, kurt greaves wrote:
>>>>>> I think if we're going to drop it to 16k, we should invest in the
>>> compact
>>>>>> sequencing as well. Just lowering it to 16k will have potentially a
>>> painful
>>>>>> impact on anyone running low memory nodes, but if we can do it
>>> without the
>>>>>> memory impact I don't think there's any reason to wait another major
>>>>>> version to implement it.
>>>>>> 
>>>>>> Having said that, we should probably benchmark the two representations
>>>>>> Ariel has come up with.
>>>>>> 
>>>>>> On Wed, 17 Oct 2018 at 20:17, Alain RODRIGUEZ <arodrime@xxxxxxxxx>
>>> wrote:
>>>>>> 
>>>>>>> +1
>>>>>>> 
>>>>>>> I would guess a lot of C* clusters/tables have this option set to the
>>>>>>> default value, and not many of them are having the need for reading
>>> so big
>>>>>>> chunks of data.
>>>>>>> I believe this will greatly limit disk overreads for a fair amount
>>> (a big
>>>>>>> majority?) of new users. It seems fair enough to change this default
>>> value,
>>>>>>> I also think 4.0 is a nice place to do this.
>>>>>>> 
>>>>>>> Thanks for taking care of this Ariel and for making sure there is a
>>>>>>> consensus here as well,
>>>>>>> 
>>>>>>> C*heers,
>>>>>>> -----------------------
>>>>>>> Alain Rodriguez - alain@xxxxxxxxxxxxxxxxx
>>>>>>> France / Spain
>>>>>>> 
>>>>>>> The Last Pickle - Apache Cassandra Consulting
>>>>>>> http://www.thelastpickle.com
>>>>>>> 
>>>>>>> Le sam. 13 oct. 2018 à 08:52, Ariel Weisberg <ariel@xxxxxxxxxxx> a
>>> écrit :
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> This would only impact new tables, existing tables would get their
>>>>>>>> chunk_length_in_kb from the existing schema. It's something we
>>> record in
>>>>>>> a
>>>>>>>> system table.
>>>>>>>> 
>>>>>>>> I have an implementation of a compact integer sequence that only
>>> requires
>>>>>>>> 37% of the memory required today. So we could do this with only
>>> slightly
>>>>>>>> more than doubling the memory used. I'll post that to the JIRA soon.
>>>>>>>> 
>>>>>>>> Ariel
>>>>>>>> 
>>>>>>>>> On Fri, Oct 12, 2018, at 1:56 AM, Jeff Jirsa wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I think 16k is a better default, but it should only affect new
>>> tables.
>>>>>>>>> Whoever changes it, please make sure you think about the upgrade
>>> path.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Oct 12, 2018, at 2:31 AM, Ben Bromhead <ben@xxxxxxxxxxxxxxx>
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> This is something that's bugged me for ages, tbh the performance
>>> gain
>>>>>>>> for
>>>>>>>>>> most use cases far outweighs the increase in memory usage and I
>>> would
>>>>>>>> even
>>>>>>>>>> be in favor of changing the default now, optimizing the storage
>>> cost
>>>>>>>> later
>>>>>>>>>> (if it's found to be worth it).
>>>>>>>>>> 
>>>>>>>>>> For some anecdotal evidence:
>>>>>>>>>> 4kb is usually what we end setting it to, 16kb feels more
>>> reasonable
>>>>>>>> given
>>>>>>>>>> the memory impact, but what would be the point if practically,
>>> most
>>>>>>>> folks
>>>>>>>>>> set it to 4kb anyway?
>>>>>>>>>> 
>>>>>>>>>> Note that chunk_length will largely be dependent on your read
>>> sizes,
>>>>>>>> but 4k
>>>>>>>>>> is the floor for most physical devices in terms of ones block
>>> size.
>>>>>>>>>> 
>>>>>>>>>> +1 for making this change in 4.0 given the small size and the
>>> large
>>>>>>>>>> improvement to new users experience (as long as we are explicit in
>>>>>>> the
>>>>>>>>>> documentation about memory consumption).
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Thu, Oct 11, 2018 at 7:11 PM Ariel Weisberg <
>>> ariel@xxxxxxxxxxx>
>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> This is regarding
>>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-13241
>>>>>>>>>>> 
>>>>>>>>>>> This ticket has languished for a while. IMO it's too late in 4.0
>>> to
>>>>>>>>>>> implement a more memory efficient representation for compressed
>>>>>>> chunk
>>>>>>>>>>> offsets. However I don't think we should put out another release
>>>>>>> with
>>>>>>>> the
>>>>>>>>>>> current 64k default as it's pretty unreasonable.
>>>>>>>>>>> 
>>>>>>>>>>> I propose that we lower the value to 16kb. 4k might never be the
>>>>>>>> correct
>>>>>>>>>>> default anyways as there is a cost to compression and 16k will
>>> still
>>>>>>>> be a
>>>>>>>>>>> large improvement.
>>>>>>>>>>> 
>>>>>>>>>>> Benedict and Jon Haddad are both +1 on making this change for
>>> 4.0.
>>>>>>> In
>>>>>>>> the
>>>>>>>>>>> past there has been some consensus about reducing this value
>>>>>>> although
>>>>>>>> maybe
>>>>>>>>>>> with more memory efficiency.
>>>>>>>>>>> 
>>>>>>>>>>> The napkin math for what this costs is:
>>>>>>>>>>> "If you have 1TB of uncompressed data, with 64k chunks that's 16M
>>>>>>>> chunks
>>>>>>>>>>> at 8 bytes each (128MB).
>>>>>>>>>>> With 16k chunks, that's 512MB.
>>>>>>>>>>> With 4k chunks, it's 2G.
>>>>>>>>>>> Per terabyte of data (pre-compression)."
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>> https://issues.apache.org/jira/browse/CASSANDRA-13241?focusedCommentId=15886621&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15886621
>>>>>>>>>>> 
>>>>>>>>>>> By way of comparison memory mapping the files has a similar cost
>>> per
>>>>>>>> 4k
>>>>>>>>>>> page of 8 bytes. Multiple mappings makes this more expensive.
>>> With a
>>>>>>>>>>> default of 16kb this would be 4x less expensive than memory
>>> mapping
>>>>>>> a
>>>>>>>> file.
>>>>>>>>>>> I only mention this to give a sense of the costs we are already
>>>>>>>> paying. I
>>>>>>>>>>> am not saying they are directly related.
>>>>>>>>>>> 
>>>>>>>>>>> I'll wait a week for discussion and if there is consensus make
>>> the
>>>>>>>> change.
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> Ariel
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
>>>>>>>>>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>> Ben Bromhead
>>>>>>>>>> CTO | Instaclustr <https://www.instaclustr.com/>
>>>>>>>>>> +1 650 284 9692
>>>>>>>>>> Reliability at Scale
>>>>>>>>>> Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer
>>>>>>>>> 
>>>>>>>>> 
>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
>>>>>>>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
>>>>>>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
>>>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
>>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
>>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
>>> 
>>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx