git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CASSANDRA-13241 lower default chunk_length_in_kb


On Thu, Oct 11, 2018 at 4:31 PM Ben Bromhead <ben@xxxxxxxxxxxxxxx> wrote:

> This is something that's bugged me for ages, tbh the performance gain for
> most use cases far outweighs the increase in memory usage and I would even
> be in favor of changing the default now, optimizing the storage cost later
> (if it's found to be worth it).
>
> For some anecdotal evidence:
> 4kb is usually what we end setting it to, 16kb feels more reasonable given
> the memory impact, but what would be the point if practically, most folks
> set it to 4kb anyway?
>
> Note that chunk_length will largely be dependent on your read sizes, but 4k
> is the floor for most physical devices in terms of ones block size.
>

It might be worth while to investigate how splitting chunk size into data,
index and compaction sizes would affect performance.


>
> +1 for making this change in 4.0 given the small size and the large
> improvement to new users experience (as long as we are explicit in the
> documentation about memory consumption).
>
>
> On Thu, Oct 11, 2018 at 7:11 PM Ariel Weisberg <ariel@xxxxxxxxxxx> wrote:
>
> > Hi,
> >
> > This is regarding https://issues.apache.org/jira/browse/CASSANDRA-13241
> >
> > This ticket has languished for a while. IMO it's too late in 4.0 to
> > implement a more memory efficient representation for compressed chunk
> > offsets. However I don't think we should put out another release with the
> > current 64k default as it's pretty unreasonable.
> >
> > I propose that we lower the value to 16kb. 4k might never be the correct
> > default anyways as there is a cost to compression and 16k will still be a
> > large improvement.
> >
> > Benedict and Jon Haddad are both +1 on making this change for 4.0. In the
> > past there has been some consensus about reducing this value although
> maybe
> > with more memory efficiency.
> >
> > The napkin math for what this costs is:
> > "If you have 1TB of uncompressed data, with 64k chunks that's 16M chunks
> > at 8 bytes each (128MB).
> > With 16k chunks, that's 512MB.
> > With 4k chunks, it's 2G.
> > Per terabyte of data (pre-compression)."
> >
> >
> https://issues.apache.org/jira/browse/CASSANDRA-13241?focusedCommentId=15886621&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15886621
> >
> > By way of comparison memory mapping the files has a similar cost per 4k
> > page of 8 bytes. Multiple mappings makes this more expensive. With a
> > default of 16kb this would be 4x less expensive than memory mapping a
> file.
> > I only mention this to give a sense of the costs we are already paying. I
> > am not saying they are directly related.
> >
> > I'll wait a week for discussion and if there is consensus make the
> change.
> >
> > Regards,
> > Ariel
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
> > For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
> >
> > --
> Ben Bromhead
> CTO | Instaclustr <https://www.instaclustr.com/>
> +1 650 284 9692
> Reliability at Scale
> Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer
>