git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Help with bad errors on 4.6.1


Sent out a PR for the issues that I observed:

https://github.com/apache/bookkeeper/pull/1240

On Thu, Mar 8, 2018 at 10:47 PM, Sijie Guo <guosijie@xxxxxxxxx> wrote:

> So the problem here is:
>
> - a corrupted request failed the V3 request decoder, so bookie switched to
> use v2 request decoder. Once the switch happen, the bookie will always use
> v2 request decoder decoding v3 request. then all your v3 requests will be
> failing with unknown op and trigger the bytebuf double releasing issue.
>
> - a corrupted response failed the V3 response decoder, so client switched
> to use v3 response decoder. Once the switch happen, the client will always
> use v2 request decoder decoding v3 response. so all the responses will be
> failing with " Received unknown response : op code"
>
> Although I don't know how the first request/response happened (it can be
> any issue, even network corruption), the problem is once this happen,
> either client/bookie will be forcing to use v2 request/response decoder and
> never change. so the problem will remain until you restarted. And this is
> the behavior that Enrico is seeing.
>
> There are a couple of issues to address here:
>
> 1) we need to add a flag to disable falling back to use v2
> request/response coder and make sure it always use v3 protocol. In this
> way, we will guarantee the problem not persist in the channel level.
> 2) we need to throw exception on unknown op code at request decode :
> https://github.com/apache/bookkeeper/blob/master/
> bookkeeper-server/src/main/java/org/apache/bookkeeper/
> proto/BookieProtoEncoding.java#L195 . As what we did at response decoder
> : https://github.com/apache/bookkeeper/blob/master/
> bookkeeper-server/src/main/java/org/apache/bookkeeper/
> proto/BookieProtoEncoding.java#L304 in https://github.com/apache/
> bookkeeper/issues/198
>
>
> Details are listed as below:
>
> --
>
> Details:
>
> - The client side stacktrace clearly showed that it is using v2 decoder on
> decoding responses. That means client failed to parse v3 response and it
> falls back to use v2 decoder on decoding responses. Because it is a
> "corrupted" v3 response, so v2 decoder can't
> find a right op code.  Then it throws illegal state exception.
>
>
> *Caused by: java.lang.IllegalStateException: Received unknown response :
> op code = 6        at
> org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseEnDeCoderPreV3.decode(BookieProtoEncoding.java:294)*
>
> - For the stacktrace at bookie side:
>
> 1. It is clear that BookieRequestHandler#L77 is called. That means the
> message is neither v2 request nor v3 request. It is a ByteBuf.
>
> https://github.com/apache/bookkeeper/blob/master/
> bookkeeper-server/src/main/java/org/apache/bookkeeper/
> proto/BookieRequestHandler.java#L77
>
> 2. V3 request decoder is using protobuf to decode bytes array. If it is
> not a valid v3 request, it will always throw exceptions. so the code
> mentioned above will never be triggered
>
> https://github.com/apache/bookkeeper/blob/master/
> bookkeeper-server/src/main/java/org/apache/bookkeeper/
> proto/BookieProtoEncoding.java#L344
>
> 3. The only case that BookieRequestHandler#L77 can happen is v2 request
> decoder doesn't parse the buffer. This leads to the bug in
>
> https://github.com/apache/bookkeeper/blob/master/
> bookkeeper-server/src/main/java/org/apache/bookkeeper/
> proto/BookieProtoEncoding.java#L194
>
>
> - How this can happen?
>
> If the client is using v3 protocol, but both client and bookie side
> stacktraces show it is using v2 protocol, that means there was a corruption
> causing client and bookie switching from v3 protocol to v2 protocol.
>
> https://github.com/apache/bookkeeper/blob/master/
> bookkeeper-server/src/main/java/org/apache/bookkeeper/
> proto/BookieProtoEncoding.java#L502
> https://github.com/apache/bookkeeper/blob/master/
> bookkeeper-server/src/main/java/org/apache/bookkeeper/
> proto/BookieProtoEncoding.java#L432
>
>
>
>
>
> On Thu, Mar 8, 2018 at 12:18 AM, Enrico Olivelli <eolivelli@xxxxxxxxx>
> wrote:
>
>> Hi all,
>> I am seeing this bad errors in some test environments with 4.6.1
>> The errors appear during rolling restart of the application, the test env
>> is made of 6 machines:
>> - 3 bookies
>> - 3 client machines (with several BK clients, of different sub-systems)
>> - running with 4.6.1 both client and servers
>>
>> *I do not have* reports of this errors from production, already running
>> 4.6.1 for the last month
>>
>> But the problem is quite scary
>>
>> This is a sample of relevant errors on clients (in this case Majordodo
>> brokers, with log level = INFO)
>>
>> logs/org.apache.bookkeeper.client.PendingAddOp:
>> 18-03-07-09-51-43       Write of ledger entry to quorum failed: L366634
>> E2557
>> 18-03-07-09-51-43       Write of ledger entry to quorum failed: L366634
>> E2558
>> 18-03-07-09-51-43       Write of ledger entry to quorum failed: L366634
>> E2559
>> 18-03-07-15-59-55       Failed to write entry (366680, 1865): Bookie
>> operation timeout
>> 18-03-07-15-59-55       Failed to write entry (366680, 1865): Bookie
>> operation timeout
>> 18-03-07-16-00-00       Failed to write entry (366680, 1865): Bookie
>> operation timeout
>> 18-03-07-16-00-00       Failed to write entry (366680, 1865): Bookie
>> operation timeout
>> 18-03-07-16-00-05       Failed to write entry (366680, 1865): Bookie
>> operation timeout
>>
>>
>> org.apache.bookkeeper.proto.PerChannelBookieClient:
>>
>> 18-03-07-10-06-44       Unexpected exception caught by bookie client
>> channel handler
>> 18-03-07-10-06-44       io.netty.handler.codec.DecoderException:
>> java.lang.IllegalStateException: Received unknown response : op code = 6
>> io.netty.handler.codec.DecoderException: java.lang.IllegalStateException:
>> Received unknown response : op code = 6
>>         at io.netty.handler.codec.MessageToMessageDecoder.channelRead(M
>> essageToMessageDecoder.java:98)
>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>> Read(AbstractChannelHandlerContext.java:362)
>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>> Read(AbstractChannelHandlerContext.java:348)
>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>> ad(AbstractChannelHandlerContext.java:340)
>>         at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
>> ByteToMessageDecoder.java:310)
>>         at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
>> ByteToMessageDecoder.java:297)
>>         at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteT
>> oMessageDecoder.java:413)
>>         at io.netty.handler.codec.ByteToMessageDecoder.channelRead(Byte
>> ToMessageDecoder.java:265)
>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>> Read(AbstractChannelHandlerContext.java:362)
>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>> Read(AbstractChannelHandlerContext.java:348)
>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>> ad(AbstractChannelHandlerContext.java:340)
>>         at io.netty.channel.DefaultChannelPipeline$HeadContext.
>> channelRead(DefaultChannelPipeline.java:1414)
>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>> Read(AbstractChannelHandlerContext.java:362)
>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>> Read(AbstractChannelHandlerContext.java:348)
>>         at io.netty.channel.DefaultChannelPipeline.fireChannelRead(Defa
>> ultChannelPipeline.java:945)
>>         at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStrea
>> mUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)
>>         at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEven
>> tLoop.java:404)
>>         at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.
>> java:304)
>>         at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(
>> SingleThreadEventExecutor.java:886)
>>         at java.base/java.lang.Thread.run(Thread.java:844)
>>
>
>
>
>
>
>> Caused by: java.lang.IllegalStateException: Received unknown response :
>> op code = 6
>>         at org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseEnDe
>> CoderPreV3.decode(BookieProtoEncoding.java:294)
>>         at org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseDeco
>> der.decode(BookieProtoEncoding.java:478)
>>         at io.netty.handler.codec.MessageToMessageDecoder.channelRead(M
>> essageToMessageDecoder.java:88)
>>
>>
>>         ... 19 more
>>
>
>
>
>
>
>
>
>>
>>
>> Errors on Bookie side, with log level = INFO)
>>
>>
>> io.netty.channel.DefaultChannelPipeline:
>> 8-03-07-09-51-38       An exceptionCaught() event was fired, and it
>> reached at the tail of the pipeline. It usually means the last handler in
>> the pipeline did not handle the exception.
>> 18-03-07-09-51-38       io.netty.util.IllegalReferenceCountException:
>> refCnt: 0, increment: 1
>> io.netty.util.IllegalReferenceCountException: refCnt: 0, increment: 1
>>         at io.netty.buffer.AbstractReferenceCountedByteBuf.release0(Abs
>> tractReferenceCountedByteBuf.java:100)
>>         at io.netty.buffer.AbstractReferenceCountedByteBuf.release(Abst
>> ractReferenceCountedByteBuf.java:84)
>>         at io.netty.util.ReferenceCountUtil.release(ReferenceCountUtil.
>> java:88)
>>         at io.netty.channel.DefaultChannelPipeline.onUnhandledInboundMe
>> ssage(DefaultChannelPipeline.java:1202)
>>         at io.netty.channel.DefaultChannelPipeline$TailContext.
>> channelRead(DefaultChannelPipeline.java:1299)
>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>> Read(AbstractChannelHandlerContext.java:362)
>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>> Read(AbstractChannelHandlerContext.java:348)
>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>> ad(AbstractChannelHandlerContext.java:340)
>>         at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(Ch
>> annelInboundHandlerAdapter.java:86)
>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>> Read(AbstractChannelHandlerContext.java:362)
>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>> Read(AbstractChannelHandlerContext.java:348)
>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>> ad(AbstractChannelHandlerContext.java:340)
>>         at org.apache.bookkeeper.proto.BookieRequestHandler.channelRead
>> (BookieRequestHandler.java:78)
>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>> Read(AbstractChannelHandlerContext.java:362)
>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>> Read(AbstractChannelHandlerContext.java:348)
>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>> ad(AbstractChannelHandlerContext.java:340)
>>         at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(Ch
>> annelInboundHandlerAdapter.java:86)
>>         at org.apache.bookkeeper.proto.AuthHandler$ServerSideHandler.ch
>> annelRead(AuthHandler.java:88)
>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>> Read(AbstractChannelHandlerContext.java:362)
>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>> Read(AbstractChannelHandlerContext.java:348)
>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>> ad(AbstractChannelHandlerContext.java:340)
>>         at io.netty.handler.codec.MessageToMessageDecoder.channelRead(M
>> essageToMessageDecoder.java:102)
>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>> Read(AbstractChannelHandlerContext.java:362)
>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>> Read(AbstractChannelHandlerContext.java:348)
>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>> ad(AbstractChannelHandlerContext.java:340)
>>         at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
>> ByteToMessageDecoder.java:310)
>>         at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
>> ByteToMessageDecoder.java:297)
>>         at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteT
>> oMessageDecoder.java:413)
>>         at io.netty.handler.codec.ByteToMessageDecoder.channelRead(Byte
>> ToMessageDecoder.java:265)
>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>> Read(AbstractChannelHandlerContext.java:362)
>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>> Read(AbstractChannelHandlerContext.java:348)
>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>> ad(AbstractChannelHandlerContext.java:340)
>>         at io.netty.channel.DefaultChannelPipeline$HeadContext.
>> channelRead(DefaultChannelPipeline.java:1414)
>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>> Read(AbstractChannelHandlerContext.java:362)
>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>> Read(AbstractChannelHandlerContext.java:348)
>>         at io.netty.channel.DefaultChannelPipeline.fireChannelRead(Defa
>> ultChannelPipeline.java:945)
>>         at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStrea
>> mUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)
>>         at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEven
>> tLoop.java:404)
>>         at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.
>> java:304)
>>         at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(
>> SingleThreadEventExecutor.java:886)
>>         at java.base/java.lang.Thread.run(Thread.java:844)
>>
>>
>>
>