[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Network problems during repair make it hang on "Wait for validation to complete"

In the previous message, I have pasted source code from cassandra 2.2.8 by mistake.
Re-checked for 2.2.11 source.
These lines are the same.

2018-06-21 2:49 GMT+05:00 Dmitry Simonov <dimmoborgir@xxxxxxxxx>:

Using Cassandra 2.2.11, I observe behaviour, that is very similar to

Steps to reproduce:
1. Set up a cluster: ccm create five -v 2.2.11 && ccm populate -n 5 --vnodes && ccm start
2. Import some keyspace into it (approx 50 Mb of data)
3. Start repair on one node: ccm node2 nodetool repair KEYSPACE
4. While repair is still running, disconnect node3: sudo iptables -I INPUT -p tcp -d -j DROP
5. This repair hangs.
6. Restore network connectivity
7. Repair is still hanging.
8. Following repairs will also hang.

In tpstats I see tasks that make no progress:

$ for i in {1..5}; do echo node$i; ccm node$i nodetool tpstats | grep "Repair#"; done
Repair#1                          1      2255              1         0                 0
Repair#1                          1      2335             26         0                 0
Repair#3                          1       147           2175         0                 0
Repair#1                          1      2335             17         0                 0

In jconsole I see that Repair threads are blocked here:
Name: Repair#1:1
State: WAITING on$Sync@73c5ab7e
Total blocked: 0  Total waited: 242

Stack trace: 
sun.misc.Unsafe.park(Native Method)

According to the source code, they are waiting for validations to complete:
# ./apache-cassandra-2.2.8-src/src/java/org/apache/cassandra/repair/
 74     public void run()
 75     {
166         // Wait for validation to complete
167         Futures.getUnchecked(validations); says that problem was fixed in 2.2.7, but I use 2.2.11.

Restart of all Cassandra nodes that have hanging tasks (one-by-one) allows these tasks to disappear from tpstats. After that repairs work well (until next network problem).

I also suppose that long GC times on one node (as well as network issues) during repair may also lead to the same problem.

Is it a known issue?

Best Regards,
Dmitry Simonov

Best Regards,
Dmitry Simonov