git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Help with bad errors on 4.6.1


2018-03-09 8:59 GMT+01:00 Sijie Guo <guosijie@xxxxxxxxx>:

> Sent out a PR for the issues that I observed:
>
> https://github.com/apache/bookkeeper/pull/1240
>


Other findings:
- my problem is not related to jdk9, it happens with jdk8 too
- the "tailing reader" is able to make progress and follow the WAL, so not
all the reads fail
- the "writer" is able to make progress and write to the WAL, so not all
the write fail

I have run BK 4.6.1 tests with Netty 4.1.21Final but there is no issue
(quite the OS as the machines with the error, I am on Fedora, machines are
on CentOS)

Enrico


>
>
> On Thu, Mar 8, 2018 at 10:47 PM, Sijie Guo <guosijie@xxxxxxxxx> wrote:
>
>> So the problem here is:
>>
>> - a corrupted request failed the V3 request decoder, so bookie switched
>> to use v2 request decoder. Once the switch happen, the bookie will always
>> use v2 request decoder decoding v3 request. then all your v3 requests will
>> be failing with unknown op and trigger the bytebuf double releasing issue.
>>
>> - a corrupted response failed the V3 response decoder, so client switched
>> to use v3 response decoder. Once the switch happen, the client will always
>> use v2 request decoder decoding v3 response. so all the responses will be
>> failing with " Received unknown response : op code"
>>
>> Although I don't know how the first request/response happened (it can be
>> any issue, even network corruption), the problem is once this happen,
>> either client/bookie will be forcing to use v2 request/response decoder and
>> never change. so the problem will remain until you restarted. And this is
>> the behavior that Enrico is seeing.
>>
>> There are a couple of issues to address here:
>>
>> 1) we need to add a flag to disable falling back to use v2
>> request/response coder and make sure it always use v3 protocol. In this
>> way, we will guarantee the problem not persist in the channel level.
>> 2) we need to throw exception on unknown op code at request decode :
>> https://github.com/apache/bookkeeper/blob/master/bookkeepe
>> r-server/src/main/java/org/apache/bookkeeper/proto/
>> BookieProtoEncoding.java#L195 . As what we did at response decoder :
>> https://github.com/apache/bookkeeper/blob/master/bookkeepe
>> r-server/src/main/java/org/apache/bookkeeper/proto/
>> BookieProtoEncoding.java#L304 in https://github.com/apache/b
>> ookkeeper/issues/198
>>
>>
>> Details are listed as below:
>>
>> --
>>
>> Details:
>>
>> - The client side stacktrace clearly showed that it is using v2 decoder
>> on decoding responses. That means client failed to parse v3 response and it
>> falls back to use v2 decoder on decoding responses. Because it is a
>> "corrupted" v3 response, so v2 decoder can't
>> find a right op code.  Then it throws illegal state exception.
>>
>>
>> *Caused by: java.lang.IllegalStateException: Received unknown response :
>> op code = 6        at
>> org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseEnDeCoderPreV3.decode(BookieProtoEncoding.java:294)*
>>
>> - For the stacktrace at bookie side:
>>
>> 1. It is clear that BookieRequestHandler#L77 is called. That means the
>> message is neither v2 request nor v3 request. It is a ByteBuf.
>>
>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-
>> server/src/main/java/org/apache/bookkeeper/proto/
>> BookieRequestHandler.java#L77
>>
>> 2. V3 request decoder is using protobuf to decode bytes array. If it is
>> not a valid v3 request, it will always throw exceptions. so the code
>> mentioned above will never be triggered
>>
>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-
>> server/src/main/java/org/apache/bookkeeper/proto/
>> BookieProtoEncoding.java#L344
>>
>> 3. The only case that BookieRequestHandler#L77 can happen is v2 request
>> decoder doesn't parse the buffer. This leads to the bug in
>>
>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-
>> server/src/main/java/org/apache/bookkeeper/proto/
>> BookieProtoEncoding.java#L194
>>
>>
>> - How this can happen?
>>
>> If the client is using v3 protocol, but both client and bookie side
>> stacktraces show it is using v2 protocol, that means there was a corruption
>> causing client and bookie switching from v3 protocol to v2 protocol.
>>
>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-
>> server/src/main/java/org/apache/bookkeeper/proto/
>> BookieProtoEncoding.java#L502
>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-
>> server/src/main/java/org/apache/bookkeeper/proto/
>> BookieProtoEncoding.java#L432
>>
>>
>>
>>
>>
>> On Thu, Mar 8, 2018 at 12:18 AM, Enrico Olivelli <eolivelli@xxxxxxxxx>
>> wrote:
>>
>>> Hi all,
>>> I am seeing this bad errors in some test environments with 4.6.1
>>> The errors appear during rolling restart of the application, the test
>>> env is made of 6 machines:
>>> - 3 bookies
>>> - 3 client machines (with several BK clients, of different sub-systems)
>>> - running with 4.6.1 both client and servers
>>>
>>> *I do not have* reports of this errors from production, already running
>>> 4.6.1 for the last month
>>>
>>> But the problem is quite scary
>>>
>>> This is a sample of relevant errors on clients (in this case Majordodo
>>> brokers, with log level = INFO)
>>>
>>> logs/org.apache.bookkeeper.client.PendingAddOp:
>>> 18-03-07-09-51-43       Write of ledger entry to quorum failed: L366634
>>> E2557
>>> 18-03-07-09-51-43       Write of ledger entry to quorum failed: L366634
>>> E2558
>>> 18-03-07-09-51-43       Write of ledger entry to quorum failed: L366634
>>> E2559
>>> 18-03-07-15-59-55       Failed to write entry (366680, 1865): Bookie
>>> operation timeout
>>> 18-03-07-15-59-55       Failed to write entry (366680, 1865): Bookie
>>> operation timeout
>>> 18-03-07-16-00-00       Failed to write entry (366680, 1865): Bookie
>>> operation timeout
>>> 18-03-07-16-00-00       Failed to write entry (366680, 1865): Bookie
>>> operation timeout
>>> 18-03-07-16-00-05       Failed to write entry (366680, 1865): Bookie
>>> operation timeout
>>>
>>>
>>> org.apache.bookkeeper.proto.PerChannelBookieClient:
>>>
>>> 18-03-07-10-06-44       Unexpected exception caught by bookie client
>>> channel handler
>>> 18-03-07-10-06-44       io.netty.handler.codec.DecoderException:
>>> java.lang.IllegalStateException: Received unknown response : op code = 6
>>> io.netty.handler.codec.DecoderException: java.lang.IllegalStateException:
>>> Received unknown response : op code = 6
>>>         at io.netty.handler.codec.MessageToMessageDecoder.channelRead(M
>>> essageToMessageDecoder.java:98)
>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:362)
>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:348)
>>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>> ad(AbstractChannelHandlerContext.java:340)
>>>         at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
>>> ByteToMessageDecoder.java:310)
>>>         at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
>>> ByteToMessageDecoder.java:297)
>>>         at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteT
>>> oMessageDecoder.java:413)
>>>         at io.netty.handler.codec.ByteToMessageDecoder.channelRead(Byte
>>> ToMessageDecoder.java:265)
>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:362)
>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:348)
>>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>> ad(AbstractChannelHandlerContext.java:340)
>>>         at io.netty.channel.DefaultChannelPipeline$HeadContext.channelR
>>> ead(DefaultChannelPipeline.java:1414)
>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:362)
>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:348)
>>>         at io.netty.channel.DefaultChannelPipeline.fireChannelRead(Defa
>>> ultChannelPipeline.java:945)
>>>         at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStrea
>>> mUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)
>>>         at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEven
>>> tLoop.java:404)
>>>         at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.jav
>>> a:304)
>>>         at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(Sin
>>> gleThreadEventExecutor.java:886)
>>>         at java.base/java.lang.Thread.run(Thread.java:844)
>>>
>>
>>
>>
>>
>>
>>> Caused by: java.lang.IllegalStateException: Received unknown response :
>>> op code = 6
>>>         at org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseEnDe
>>> CoderPreV3.decode(BookieProtoEncoding.java:294)
>>>         at org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseDeco
>>> der.decode(BookieProtoEncoding.java:478)
>>>         at io.netty.handler.codec.MessageToMessageDecoder.channelRead(M
>>> essageToMessageDecoder.java:88)
>>>
>>>
>>>         ... 19 more
>>>
>>
>>
>>
>>
>>
>>
>>
>>>
>>>
>>> Errors on Bookie side, with log level = INFO)
>>>
>>>
>>> io.netty.channel.DefaultChannelPipeline:
>>> 8-03-07-09-51-38       An exceptionCaught() event was fired, and it
>>> reached at the tail of the pipeline. It usually means the last handler in
>>> the pipeline did not handle the exception.
>>> 18-03-07-09-51-38       io.netty.util.IllegalReferenceCountException:
>>> refCnt: 0, increment: 1
>>> io.netty.util.IllegalReferenceCountException: refCnt: 0, increment: 1
>>>         at io.netty.buffer.AbstractReferenceCountedByteBuf.release0(Abs
>>> tractReferenceCountedByteBuf.java:100)
>>>         at io.netty.buffer.AbstractReferenceCountedByteBuf.release(Abst
>>> ractReferenceCountedByteBuf.java:84)
>>>         at io.netty.util.ReferenceCountUtil.release(ReferenceCountUtil.
>>> java:88)
>>>         at io.netty.channel.DefaultChannelPipeline.onUnhandledInboundMe
>>> ssage(DefaultChannelPipeline.java:1202)
>>>         at io.netty.channel.DefaultChannelPipeline$TailContext.channelR
>>> ead(DefaultChannelPipeline.java:1299)
>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:362)
>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:348)
>>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>> ad(AbstractChannelHandlerContext.java:340)
>>>         at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(Ch
>>> annelInboundHandlerAdapter.java:86)
>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:362)
>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:348)
>>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>> ad(AbstractChannelHandlerContext.java:340)
>>>         at org.apache.bookkeeper.proto.BookieRequestHandler.channelRead
>>> (BookieRequestHandler.java:78)
>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:362)
>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:348)
>>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>> ad(AbstractChannelHandlerContext.java:340)
>>>         at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(Ch
>>> annelInboundHandlerAdapter.java:86)
>>>         at org.apache.bookkeeper.proto.AuthHandler$ServerSideHandler.ch
>>> annelRead(AuthHandler.java:88)
>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:362)
>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:348)
>>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>> ad(AbstractChannelHandlerContext.java:340)
>>>         at io.netty.handler.codec.MessageToMessageDecoder.channelRead(M
>>> essageToMessageDecoder.java:102)
>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:362)
>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:348)
>>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>> ad(AbstractChannelHandlerContext.java:340)
>>>         at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
>>> ByteToMessageDecoder.java:310)
>>>         at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
>>> ByteToMessageDecoder.java:297)
>>>         at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteT
>>> oMessageDecoder.java:413)
>>>         at io.netty.handler.codec.ByteToMessageDecoder.channelRead(Byte
>>> ToMessageDecoder.java:265)
>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:362)
>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:348)
>>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>> ad(AbstractChannelHandlerContext.java:340)
>>>         at io.netty.channel.DefaultChannelPipeline$HeadContext.channelR
>>> ead(DefaultChannelPipeline.java:1414)
>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:362)
>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>> Read(AbstractChannelHandlerContext.java:348)
>>>         at io.netty.channel.DefaultChannelPipeline.fireChannelRead(Defa
>>> ultChannelPipeline.java:945)
>>>         at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStrea
>>> mUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)
>>>         at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEven
>>> tLoop.java:404)
>>>         at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.jav
>>> a:304)
>>>         at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(Sin
>>> gleThreadEventExecutor.java:886)
>>>         at java.base/java.lang.Thread.run(Thread.java:844)
>>>
>>>
>>>
>>
>