[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Can't get ActiveMQ Artemis 2.6.2 shared store failover to work

Thanks for the response Justin,

Thanks for the information about split brain, it's very helpful.

You were right about why failover didn't work. /mnt/c/artemis-data is a
directory on my local drive. I was running Artemis on Windows 10 through
the Linux subsystem, turns out that was what broke failover. I guess there
are some differences between the subsystem's file system and the one on a
real Linux install.

I'm still having some trouble getting failover to work smoothly from the
client's perspective. If I use static URLs, it works perfectly, but I can't
get it to work properly with discovery groups.

As far as I can tell, the ServerLocatorImpl is supposed to receive
broadcasts from servers in the cluster, and use the received connector URLs
to connect to a live server. The first time the locator is used, it will
wait for a broadcast.
Any subsequent time it is used, it will not wait for a broadcast, but just
use the known URLs from earlier broadcasts. Backup servers don't seem to
send broadcasts. This means that the locator will only receive the live
server URL.
When I crash the master server, the locator will fail to create a new
connection, because the backup server URL isn't known, and it doesn't wait
for a broadcast when asked to create a new session.
This means I can't get a smooth failover to the backup, because the locator
will throw a connection loss exception until it receives a broadcast from
the backup.

Is there a way to make the failover smoother in this case, or would I need
to use static connectors instead of discovery groups?

The documentation for broadcast groups at
artemis/docs/latest/clusters.html states that a broadcast group can specify
both a primary connector-ref and a backup connector-ref.
Am I misunderstanding this documentation, or should I list the backup
server's connector URL in the broadcast group in the master broker.xml as
well (and vice versa)?

While looking at this I also stumbled on some odd behavior in a couple of
places in the code.
FileConfigurationParser takes the first connector-ref in the broadcast
group, and duplicates it if there are more than 1 connector-ref in the
configuration. I think the intent was to read all the connector-refs, not
the first connector-ref multiple times(?)
DiscoveryGroup stores received connectors by the broadcasting node id. This
effectively means that a node may broadcast more than 1 connector, but only
the last connector is used. Why broadcast multiple connectors then?

I also have another related issue. Once the backup server becomes live, it
starts sending broadcasts. It looks to me like the node id for the backup
is the same as the node id for the master.
As a result, the DiscoveryGroup doesn't consider the connector it receives
to the backup to have changed from the connector it had for the master,
because it doesn't check if the DiscoveryEntry has changed, but only if it
didn't have an entry for that node id before
This causes the DiscoveryGroup to never update the locator, so the locator
can't reconnect until the master comes back up.

I've uploaded the broker.xml for the master and slave to pastebin, they
were generated with the following commands:
./artemis create --clustered --cluster-user test --cluster-password test
--user test --password test --shared-store --data C:\artemis-data --host
localhost --http-port 8161 --failover-on-shutdown --default-port 61616
./artemis create --clustered --cluster-user test --cluster-password test
--user test --password test --shared-store --data C:\artemis-data --host
localhost --http-port 8162 --failover-on-shutdown --default-port 61617
--slave C:\artemis-slave

My client is a Camel program using camel-jms with the following connection

        DiscoveryGroupConfiguration discoveryConfig = new
        UDPBroadcastEndpointFactory broadcastEndpointFactory = new
        ActiveMQConnectionFactory connectionFactory = ActiveMQJMSClient.
createConnectionFactoryWithHA(discoveryConfig, JMSFactoryType.CF);

I'm running Windows 10, and both master and slave are running on localhost.
Let me know if I should post an example client project as well.

2018-08-01 21:06 GMT+02:00 Justin Bertram <jbertram@xxxxxxxxxx>:

> > Is it possible to have more than one backup of the data?
> It is not possible to have more that one backup of the data managed by the
> broker.  You are, of course, free to use technology to replicate the data
> underneath the broker (e.g. replicated filesystem).
> > My understanding is that to avoid split brain (
> I'd need at least 6 servers (3 live, 3 backup). Is this correct, and does
> it only apply to replication HA, or also to shared store HA?
> To mitigate against the chance of split brain it is recommended to have an
> odd number of live brokers in the cluster so a majority is easy to
> establish.  The smallest odd number larger than 1 is 3.  Hence the
> recommendation for 3 live/backup pairs.  Keep in mind that these can be
> colocated to avoid wasting resources.
> Split brain is only a problem in the case of replication.  In the
> shared-store use-case the shared storage itself mitigates against
> split-brain.
> > Am I misunderstanding how the failover should work, or is there something
> wrong with the configuration?
> I can think of two options here off the top of my head:
>   1) The shared storage doesn't properly implement file locks.  Can you
> elaborate on what "/mnt/c/artemis-data" is?  Is it NFS or some other kind
> of NAS?
>   2) There is a bug in the way the "artemis create" generates the
> configuration.  Could you paste (or pastebin) the configuration from both
> the live and the backup?
> Justin
> On Wed, Aug 1, 2018 at 7:57 AM, Stig Rohde Døssing <stigdoessing@xxxxxxxxx
> >
> wrote:
> > Hi,
> >
> > I'm new to ActiveMQ, so I have a couple of conceptual questions as well
> as
> > a technical one.
> >
> > I'd like to set up Artemis in a high availability configuration, so the
> > queue system as a whole keeps working, even if I disable single machines
> in
> > the cluster. I'm familiar with Kafka, which provides this ability via a
> > Zookeeper quorum, replication and leader elections.
> >
> > Going by the documentation at
> >, I get the
> > impression that each live server can only have a single backup. Is it
> > possible to have more than one backup of the data?
> >
> > My understanding is that to avoid split brain (
> >,
> > I'd need at least 6 servers (3 live, 3 backup). Is this correct, and does
> > it only apply to replication HA, or also to shared store HA?
> >
> > I wanted to try out shared store failover behavior, so I set up two
> brokers
> > locally using the following commands:
> >
> > ./artemis create --clustered --shared-store --data /mnt/c/artemis-data
> > --host localhost --http-port 8161 --failover-on-shutdown --default-port
> > 61616 /mnt/c/artemis-master
> >
> > ./artemis create --clustered --shared-store --data /mnt/c/artemis-data
> > --host localhost --http-port 8162 --failover-on-shutdown --default-port
> > 61617 --slave /mnt/c/artemis-slave
> >
> > I can't get the backup to take over when doing this. The log is spammed
> > with the following message in both brokers:
> >
> > 2018-08-01 11:48:33,897 WARN  [org.apache.activemq.artemis.core.client]
> > AMQ212034: There are more than one servers on the network broadcasting
> the
> > same node id. You will see this message exactly once (per node) if a node
> > is restarted, in which case it can be safely ignored. But if it is logged
> > continuously it means you really do have more than one node on the same
> > network active concurrently with the same node id. This could occur if
> you
> > have a backup node active at the same time as its live node.
> > nodeID=cb201578-9580-11e8-b925-f01faf531f94
> >
> > Just for the sake of completeness, I tried replacing the
> > broadcast/discovery group in broker.xml with static-connector
> > configuration, and this gets rid of this warning but the backup still
> won't
> > take over for the master when I kill the master process. The backup
> broker
> > clearly logs that it has lost connection to another server in the
> cluster,
> > but it doesn't seem to take the live role.
> >
> > Am I misunderstanding how the failover should work, or is there something
> > wrong with the configuration?
> >