git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: saving distinct data in cassandra result in many tombstones


1. How to use sharding partition key in a way that partitions end up in different nodes?
You could, for example, create a table with a bucket column added to the partition key:
Table distinct(
hourNumber int,
bucket int, //could be a 5 minute bucket for example
key text,
distinctValue long
primary key ((hourNumber,bucket))
)

2. if i set gc_grace_seconds to 0, would it replace the row at memtable (not saving repeated rows in sstables) or it would be done at first compaction?
Overlapping rows in the memtables are merged regardless of the gc_grace_seconds period. Setting gc_grace_seconds to 0 will immediately evict tombstones during compaction but will disable hints delivery. You should set gc_grace_seconds>max_hint_window_in_ms



On Tue, Jun 19, 2018 at 7:23 AM, onmstester onmstester <onmstester@xxxxxxxx> wrote:
Two other questions:
1. How to use sharding partition key in a way that partitions end up in different nodes?
2. if i set gc_grace_seconds to 0, would it replace the row at memtable (not saving repeated rows in sstables) or it would be done at first compaction?

Sent using Zoho Mail



---- On Tue, 19 Jun 2018 08:16:28 +0430 onmstester onmstester <onmstester@xxxxxxxx> wrote ----

Can i set gc_grace_seconds to 0 in this case? because reappearing deleted data has no impact on my Business Logic, i'm just either creating a new row or replacing the exactly same row. 

Sent using Zoho Mail



---- On Wed, 13 Jun 2018 03:41:51 +0430 Elliott Sims <elliott@xxxxxxxxxxxxx> wrote ----



If this is data that expires after a certain amount of time, you probably want to look into using TWCS and TTLs to minimize the number of tombstones.
Decreasing gc_grace_seconds then compacting will reduce the number of tombstones, but at the cost of potentially resurrecting deleted data if the table hasn't been repaired during the grace interval.  You can also just increase the tombstone thresholds, but the queries will be pretty expensive/wasteful.

On Tue, Jun 12, 2018 at 2:02 AM, onmstester onmstester <onmstester@xxxxxxxx> wrote:


Hi,

I needed to save a distinct value for a key in each hour, the problem with saving everything and computing distincts in memory is that there
are too many repeated data.
Table schema:
Table distinct(
hourNumber int,
key text,
distinctValue long
primary key (hourNumber)
)

I want to retrieve distinct count of all keys in a specific hour and using this data model it would be achieved by reading a single partition.
The problem : i can't read from this table, system.log indicates that more than 100K tombstones read and no live data in it. The gc_grace time is
the default (10 days), so i thought decreasing it to 1 hour and run compaction, but is this a right approach at all? i mean the whole idea of replacing
some millions of rows. each  10 times in a partition again and again that creates alot of tombstones just to achieve distinct behavior?

Thanks in advance

Sent using Zoho Mail







( ! ) Warning: include(msgfooter.php): failed to open stream: No such file or directory in /var/www/git/apache-cassandra-users/msg06219.html on line 78
Call Stack
#TimeMemoryFunctionLocation
10.0008368696{main}( ).../msg06219.html:0

( ! ) Warning: include(): Failed opening 'msgfooter.php' for inclusion (include_path='.:/var/www/git') in /var/www/git/apache-cassandra-users/msg06219.html on line 78
Call Stack
#TimeMemoryFunctionLocation
10.0008368696{main}( ).../msg06219.html:0