nova-api missing heartbeats to rabbitmq
On Tue, 2020-03-24 at 15:04 -0400, Satish Patel wrote:
> Recently i am seeing lots of error message in rabbitmq logs saying
> missing heartbeats from nova-api nodes, I am not seeing any issue at
> functionality level as everything working fine but just noticed those
> error and trying to find root cause of it.
this is expected and a know issue.
a release or two ago we intoduced the use of eventlet monkey patching to the nova api
to implemente multi cell scarter gater requests where by we concurrently dispatch request
to all cells and then wait for the results instead of doing it serially.
a side effect of that change is not that the nova api is monkey patched expcitly, if you execute
it via uwsgi or mod_wsgi the heatbeat thread that was previously a full os thread is not jsut
a green thread. the wsgi server manges the life time of the api process and can set that tread to sleep or
at presnet there is nothing for the operator to do in regards to this message and you should just ignore
bar one caviate. if you are configuring your api you should not scale it useing thread but instead shoudl scale
the api using processes.
deploying the api as a wsgi applciation with multiple threads per python process can cause issues so threads should
always be set to 1 or unset. we have no real agreement on the long term fix.
in some environments disableing the heartbeat and relying on the os tcp keepalive config is one option.
you can also rever to running the api using the build in python wsgi server instead of uwsgi. if you do this
there is a performacne pelenty so we dont really advise people to do that.
there have been mail thread on this topic in the past but i do not have them to hand.
> 172.28.15.125 nova-api server
> 172.28.15.192 rabbitmq server
> on rabbit.log
> 2020-03-24 12:21:41.389 [error] <0.29772.4418> closing AMQP connection
> <0.29772.4418> (172.28.15.125:42656 -> 172.28.15.192:5671 -
> missed heartbeats from client, timeout: 60s
> on nova-api.log
> 2020-03-24 12:19:06.554 32435 ERROR
> oslo.messaging._drivers.impl_rabbit [-]
> [4b8adff0-ff9f-4863-a939-537d391e5d9e] AMQP server on
> 172.28.15.192:5671 is unreachable: [Errno 104] Connection reset by
> peer. Trying again in 1 seconds.: error: [Errno 104] Connection reset
> by peer