[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Stumped By Cassandra delays


We are running Cassandra 2.1.14 in AWS, with c5.4xlarge machines
(initially these were m4.xlarge) for our cassandra servers and
m4.xlarge for our application servers. On one of our clusters having
problems we have 6 C* nodes and 6 AS nodes (two nodes for C*/AS in
each availability zone).

In the deployed application it seems to be a common use-case to one of
the following. These use cases are having periodic errors:
(1) Copy one Cassandra table to another table using the application server.
(2) Export from a Cassandra table to file using the application server.

The application server is reading from the table via token range, the
token range queries being calculated to ensure the whole token range
for a query falls on the same node. i.e. the query looks like this:

select * from <table> where token(key) > ? and token(key) <= ?

This was probably initially done on the assumption that the driver
would be able to figure out which nodes contained the data. As we
realized now the driver only supports routing to the right node if the
partition key is defined in the where clause.

When we do the read we are doing a lot of queries in parallel to
maximize performance. I believe when the copy is being run there are
currently 5 threads per machine doing the copy for a max of 30
concurrent read requests across the cluster.

Specifically these tasks been periodically having a few of these errors:

INFO  [ScheduledTasks:1] 2018-07-13 20:03:20,124 - REQUEST_RESPONSE messages were dropped in
last 5000 ms: 1 for internal timeout and 0 for cross node timeout

Which are causing errors in the read by token range queries.

Running "nodetool settraceprobability 1" and running the test when
failing we could see that this timeout would occur when using a
coordinator on the read query (i.e. the co-ordinator sent the message
but didn't get a response to the query from the other node within the
time limit). We were seeing these timeouts periodically even if we set
the timeouts to 60 seconds.

As I mentioned at the beginning we had initially been using m4.xlarge
for our Cassandra servers. After discussion with AWS it was suggested
that we could be hitting performance limits (i.e. either network or
disk - I believe more likely network as I didn't see the disk getting
hit very hard) so we upgraded the Cassandra servers and everything was
fine for a while.

But then the problems started to re-occur recently...pretty
consistently failing on these copy or export jobs running overnight.
Having looked at resource usage statistics graphs it appeared that the
C* servers were not heavily loaded at all (the app servers were being
maxed out) and I did not see any significant garbage collections in
the logs that could explain the delays.

As a last resort I decided to turn up the logging on the server and
client, datastax client set to debug and server set to the following
logs via nodetool...the goal being to maximize logging while cutting
out the very verbose stuff (e.g. appears to print out the
whole message in 2.1.14 when put into debug -> it looks like that was
moved to trace in a later 2.1.x release):

bin/nodetool setlogginglevel org.apache.cassandra.tracing.Tracing INFO
bin/nodetool setlogginglevel org.apache.cassandra.transport.Message INFO
bin/nodetool setlogginglevel org.apache.cassandra.db.ColumnFamilyStore DEBUG
bin/nodetool setlogginglevel org.apache.cassandra.gms.Gossiper DEBUG
bin/nodetool setlogginglevel
org.apache.cassandra.db.filter.SliceQueryFilter DEBUG
bin/nodetool setlogginglevel
org.apache.cassandra.service.pager.AbstractQueryPager INFO
bin/nodetool setlogginglevel org.apache.cassandra TRACE

Of course when we did this (as part of turning on the logging the
application servers were restarted) the problematic export to file
jobs which had failed every time for the last week succeeded and ran
much faster than they had run usually (47 minutes vs 1 1/2 hours) so I
decided to look for the biggest delay (which turned out to be ~9
seconds and see what I could find in the log - outside of this time,
the response times were up to perhaps 20ms). Here is what I found:

(1) Only one Cassandra node had delays at a time.

(2) On the Cassandra node that did had delays there was no significant
information from the GCInspector (the system stopped processing client
requests between 05:32:33 - 05:32:43). If anything it confirmed my
belief that the system was lightly loaded

DEBUG [Service Thread] 2018-07-20 05:32:25,559 -
ParNew GC in 8ms.  CMS Old Gen: 2879689192 -> 2879724544; Par Eden
Space: 335544320 -> 0; Par Survivor Space: 8701480 -> 12462936
DEBUG [Service Thread] 2018-07-20 05:32:26,792 -
ParNew GC in 9ms.  CMS Old Gen: 2879724544 -> 2879739504; Par Eden
Space: 335544320 -> 0; Par Survivor Space: 12462936 -> 9463872
DEBUG [Service Thread] 2018-07-20 05:32:29,227 -
ParNew GC in 9ms.  CMS Old Gen: 2879739504 -> 2879755872; Par Eden
Space: 335544320 -> 0; Par Survivor Space: 9463872 -> 10860856
DEBUG [Service Thread] 2018-07-20 05:32:33,190 -
ParNew GC in 8ms.  CMS Old Gen: 2879755872 -> 2879780816; Par Eden
Space: 335544320 -> 0; Par Survivor Space: 10860856 -> 13637832
DEBUG [Service Thread] 2018-07-20 05:32:46,128 -
ParNew GC in 8ms.  CMS Old Gen: 2879780816 -> 2879807064; Par Eden
Space: 335544320 -> 0; Par Survivor Space: 13637832 -> 7992928
DEBUG [Service Thread] 2018-07-20 05:32:51,001 -
ParNew GC in 9ms.  CMS Old Gen: 2879807064 -> 2879854128; Par Eden
Space: 335544320 -> 0; Par Survivor Space: 7992928 -> 10300800

(3) The Cassandra node in question was still responding to MUTATION
requests (there are a small number of inserts going on at the same
time as this flow - probably about 40 every 10 seconds) and poll
requests from Gossip requests from other servers. When the problematic
node polled other servers it still reported a response being received
within at most a couple of milliseconds:

TRACE [GossipTasks:1] 2018-07-20 05:32:36,043 - /10.123.4.xx sending GOSSIP_DIGEST_SYN to
TRACE [GossipStage:1] 2018-07-20 05:32:36,043 - Received a GossipDigestAckMessage
from /10.123.4.yy

(4) The "bad" Cassandra node was even processing MUTATION requests -
it just wasn't sending any responses back until after 05:32:43. e.g.
this is one of many occurring during the stop period:

TRACE [SharedPool-Worker-1] 2018-07-20 05:32:36,596 - /10.123.4.xx sending MUTATION to
<<< TRACE [SharedPool-Worker-1] 2018-07-20 05:32:36,595 - /10.123.4.zz sending REQUEST_RESPONSE to

>From the AS nodes:

2018-07-20 05:32:43,084 [luster1-nio-worker-5] [          ] [
          ] [                    ] (         core.QueryLogger.SLOW)
DEBUG   - [cluster1] [/] Query too slow, took 1731 ms:
[1 statements, 6 bound values] BEGIN BATCH INSERT INTO <table>
(a,b,c,d,e) VALUES (?, ?, ?, ?, ?) using ttl ?; APPLY BATCH;  >>> Must
have been sent ~05:32:41
2018-07-20 05:32:43,049 [luster1-nio-worker-5] [          ] [
          ] [                    ] (         core.QueryLogger.SLOW)
DEBUG   - [cluster1] [/] Query too slow, took 4411 ms:
[1 statements, 6 bound values] BEGIN BATCH INSERT INTO <table> (a, b,
c, d, e) VALUES (?, ?, ?, ?, ?) using ttl ?; APPLY BATCH; >>> Must
have been sent ~05:32:39
2018-07-20 05:32:43,185 [luster1-nio-worker-5] [          ] [
          ] [                    ] (         core.QueryLogger.SLOW)
DEBUG   - [cluster1] [/] Query too slow, took 5950 ms:
[1 statements, 6 bound values] BEGIN BATCH INSERT INTO <table> (a, b,
c, d, e) VALUES (?, ?, ?, ?, ?) using ttl ?; APPLY BATCH; >>> Must
have been sent ~05:32:37

I guess the question I have here is - what could I be missing which
could be causing this issue?:

(a) Could there still be network limits between our AS and C* servers?
Would it be reasonable to try to update our AS servers? I understand
that m4.xlarge have much lower network throughputs than the c5.4xlarge

(b) Could a potential bug in the AS screw up Cassandra? At least when
I have been looking at the past at netstat (before these servers were
upgraded - I only have very limited access to the specific servers now
having the problem), I could see big send queues building up on
Cassandra. I am wondering if there is a bug in the AS causing send
queues to the AS to build up, could this potentially cause Cassandra
threads to block and eventually cause timeouts (e.g. AS not reading
fast enough causes Cassandra processing threads to block which
eventually stops Cassandra from having threads available to process
internode requests and thus causing timeouts)? Or is this just not

(c) Could this be related to a bug in Cassandra that is already fixed
in a later version of 2.1? I know there were some issues with netty
like this:

which was covered when netty was upgraded to 4.0.44 for
CASSANDRA-13114. I can't believe we are getting close to the native
transport requests limit (CASSANDRA-11363), especially when we are
only exporting to file (as there should be a maximum of around 30
requests outstanding roughly per cluster, not counting the occasional
insert from other tasks).

I am trying very hard to find a reason to justify an upgrade as I know
staying on an older point release is problematic.

(d) Is what we are doing just not a very good idea...and we are just
asking for problems by doing it this way? i.e. should be exporting a
different way...or perhaps reduce the parallelizing of these requests
(I think the thread count was reduced in the past before the C* server
upgrade and we still had problems, but I could always try again).

Any ideas/suggestions are greatly appreciated.

thanks in advance,
Gareth Collins

To unsubscribe, e-mail: user-unsubscribe@xxxxxxxxxxxxxxxxxxxx
For additional commands, e-mail: user-help@xxxxxxxxxxxxxxxxxxxx