git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Help with bad errors on 4.6.1


Findings of today:
A - the system fails even with BK 4.6.0
B - we have moved all the clients and the bookies to different machines
(keeping the same ZK cluster), same problem
C - I have copies of the application which are running on other similar
machines (on the same Blade/VMWare system)
D - I have tried to disable Netty polls on client side (Sijie's
suggestion): no effect
E - with ensemblesize = 1 the problem on readers does not occour, but the
writer seems not to be able to recover from a restart of the only bookie
(seems stuck at writing on logger PendingAddOp "Failed to write entry ....
Bookie Operation Timed Out")
F - ZK cluster is working perfeclty as it is serving a lot of other
services of the application (Kafka, Majordodo, BlazingCache, HBase....)
without errors
G - all of the other distributed components are running without issues
(Kafka,HDFS ....see the list above about ZK) and other database connections
too (the application connects to serveral external machines)
H - bookkeper bookiesanity is running OKAY on every bookie
I - my collegues checked networking and VMWARE and OS, we were suspecting
about problems on lookback interfaces but the problem still occours moving
each part on a dedicated machine
L - I have tested with 4.6.2-SNAPSHOT...same as above
M - the problem starts when a bookie restarts and then joins the cluster
again (not when you kill it)

given all of these facts:
1) It may be a problem of network/SO (given points F and G I doubt)
2) it may be a bug on BK
3) it is not a regression on 4.6.1 but 4.6.2 has no fix
4) I will intrument BK code in order to have better debug of the error
5) I will create a reproducer without the full application (which is huge)

I have memory (hprof) dumps of a failing client and a failing bookie if
someone has time to spend, honestly I have already spent some time in order
to find some leak/bad recycler, but without success (not sure this is the
good way to approach this problem)

I have no proof but maybe there is a problem with Pending reads, when the
bookie is down the read remains "pending", then when the channel is active
again (the bookie joins the cluster) that pending "old" read (which is not
needed anymore) reaches the bookie and crash everything.

It is interesting that it seems that "other" bookies break, not the one
which joins the cluster (this is what is seems to me)

@Ivan
I wonder if some tests on Jepsen with bookie restarts may find this kind of
issues, given that it is not a network/SO problem

Regards

Enrico





2018-03-12 20:51 GMT+01:00 Enrico Olivelli <eolivelli@xxxxxxxxx>:

>
>
> Il lun 12 mar 2018, 20:40 Ivan Kelly <ivank@xxxxxxxxxx> ha scritto:
>
>> > It is interesting that the problems is on 'readers' and it seems that
>> the
>> > PCBC seems corrupted and even writes (if the broker is promoted to
>> > 'leader') are able to go on after the reads broke the client.
>> Are writes coming from the same clients? Or clients in the same process?
>>
>
> Same o.a.b.c.BookKeeper object
>
>>
>> -Ivan
>>
> --
>
>
> -- Enrico Olivelli
>