git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 1.2.19: AssertionError when running compactions on a CF with TTLed columns


Hi everyone,

I was finally able to sort out my problem in an "interesting" manner that I think is worth sharing on the list!

What I did is the following: on each node, I stopped Cassandra, completely dropped the data files of the column family, started Cassandra again and issued a repair for this column family.

The process took time since the cluster is formed of 40 nodes, but once done, the nodes didn't exhibit this assertion error anymore!

I believe this was triggered because of me tweaking the "sstable_size_in_mb" parameter. Somehow I had data files with different sizes and it confused Cassandra.

So, problem solved now :-)

Cheers,
Reynald


On Fri, Aug 31, 2018 at 7:45 AM Reynald Borer <reynald.borer@xxxxxxxxx> wrote:
Hi everyone,

I'm running a Cassandra 1.2.19 cluster of 40 nodes and compactions of a specific column family are sporadically raising an AssertionError like this (full stack trace visible under https://gist.github.com/rborer/46862d6d693c0163aa8fe0e74caa2d9a):

ERROR [CompactionExecutor:9137] 2018-08-27 11:43:05,197 org.apache.cassandra.service.CassandraDaemon - Exception in thread Thread[CompactionExecutor:9137,1,main]
java.lang.AssertionError: 2
at org.apache.cassandra.db.compaction.LeveledManifest.replace(LeveledManifest.java:267)

The data written in this column family can be seen as wide rows, that is, rows with lots of columns. Each column has a TTL of 7 days though.

Whenever this happens, it seems to block compactions of this column family (I see the pending compactions increasing) until I restart the failing node.

I have searched on jira and on this mailing-list about this issue without too much luck. I suspect it may be related to https://issues.apache.org/jira/browse/CASSANDRA-6563 although it's hard for to confirm.

I know this version is pretty old, does this issue anyway rings a bell to one of you?

Here are some more details about my cluster:

- it is composed of 40 nodes
- it is pretty old and I'm in the process of upgrading it, thus it was running without issues under version 1.0.12 & 1.1.12
- it really affect a single column family only (schema can be seen on https://gist.github.com/rborer/46862d6d693c0163aa8fe0e74caa2d9a#file-schema-txt)
- my cluster is set up with RandomPartitioner (inherited from when it was set up on version 0.7) and a replication factor of 3
- it's running weekly repairs (and this assertion happens mostly during repairs)
- what I also noted is that since the cluster was upgraded to 1.2.19 the disk size of this column family keeps increasing (it went from 400G to 1.2T!)

Thanks in advance for your help.

Best regards,
Reynald