Need to review java gc, system , network, disk, memory, node, and table statistics. A lot can be discerned from visually examining the charts. Eg. if the nodes with the most local reads is failing or is it the one with the most writes or is it completely unrelated.
Since it’s a distributed system you need to review the data points together for all nodes. Data is the only way to see what’s going on. Either connect Prometheus / Grafana , get Datadog , New Relic, or something else to see the patterns across the cluster.
I assembled that list recently — I would even add that getting system logs into ELK or Splunk could also show some patterns otherwise not detected tailing and gripping.
On Jul 26, 2018, 10:20 AM -0400, R1 J1 <rjsoft100@xxxxxxxxx>, wrote:
Thanks for your prompt replies. No the same node is not bouncing over. When you say it is about to tip over: What can we do to stop that ?
Also about that error : you guys are correct: it is a warning and might not be contributing to the node bounce issue and it can be removed by changing batch_size_warn_threshold_in_kb: 5