git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Help with bad errors on 4.6.1


Enrico, could you summarize what the state of things is now? What are
you running, what problems are you seeing and how are the problems
manifesting themselves.

Regards,
Ivan

On Mon, Mar 12, 2018 at 10:15 AM, Enrico Olivelli <eolivelli@xxxxxxxxx> wrote:
> Applyed Sijie's fixes and added some debug:
>
> Problem is triggered when you restart a bookie (I have a cluster of 3
> bookies, WQ = 2 and AQ = 2)
>
> Below a new error on client side ("tailing" reader)
>
> Enrico
>
> this is a new error on client side:
> 18-03-12-09-11-45    Unexpected exception caught by bookie client channel
> handler
> 18-03-12-09-11-45    io.netty.handler.codec.DecoderException:
> com.google.protobuf.InvalidProtocolBufferException: Protocol message
> contained an invalid tag (zero).
> io.netty.handler.codec.DecoderException:
> com.google.protobuf.InvalidProtocolBufferException: Protocol message
> contained an invalid tag (zero).
>     at
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:98)
>     at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>     at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>     at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
>     at
> io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
>     at
> io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)
>     at
> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)
>     at
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
>     at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>     at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>     at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
>     at
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
>     at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>     at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>     at
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
>     at
> io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:979)
>     at
> io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:401)
>     at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:306)
>     at
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
>     at java.base/java.lang.Thread.run(Thread.java:844)
> Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol
> message contained an invalid tag (zero).
>     at
> com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:105)
>     at
> com.google.protobuf.CodedInputStream$StreamDecoder.readTag(CodedInputStream.java:2060)
>     at
> org.apache.bookkeeper.proto.BookkeeperProtocol$ReadResponse.<init>(BookkeeperProtocol.java:10009)
>     at
> org.apache.bookkeeper.proto.BookkeeperProtocol$ReadResponse.<init>(BookkeeperProtocol.java:9975)
>     at
> org.apache.bookkeeper.proto.BookkeeperProtocol$ReadResponse$1.parsePartialFrom(BookkeeperProtocol.java:10869)
>     at
> org.apache.bookkeeper.proto.BookkeeperProtocol$ReadResponse$1.parsePartialFrom(BookkeeperProtocol.java:10864)
>     at
> com.google.protobuf.CodedInputStream$StreamDecoder.readMessage(CodedInputStream.java:2362)
>     at
> org.apache.bookkeeper.proto.BookkeeperProtocol$Response.<init>(BookkeeperProtocol.java:7852)
>     at
> org.apache.bookkeeper.proto.BookkeeperProtocol$Response.<init>(BookkeeperProtocol.java:7782)
>     at
> org.apache.bookkeeper.proto.BookkeeperProtocol$Response$1.parsePartialFrom(BookkeeperProtocol.java:9887)
>     at
> org.apache.bookkeeper.proto.BookkeeperProtocol$Response$1.parsePartialFrom(BookkeeperProtocol.java:9882)
>     at
> com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:221)
>     at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:239)
>     at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
>     at
> com.google.protobuf.GeneratedMessageV3.parseWithIOException(GeneratedMessageV3.java:320)
>     at
> org.apache.bookkeeper.proto.BookkeeperProtocol$Response.parseFrom(BookkeeperProtocol.java:8454)
>     at
> org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseEnDecoderV3.decode(BookieProtoEncoding.java:329)
>     at
> org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseDecoder.decode(BookieProtoEncoding.java:470)
>     at
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
>     ... 19 more
>
>
> 2018-03-09 22:37 GMT+01:00 Sijie Guo <guosijie@xxxxxxxxx>:
>
>> Enrico,
>>
>> I would suggest you applied my fixes and then debug from there. In this
>> way, you will have a better sense where the first corruption is from.
>>
>> Sijie
>>
>> On Fri, Mar 9, 2018 at 11:48 AM Enrico Olivelli <eolivelli@xxxxxxxxx>
>> wrote:
>>
>> > Il ven 9 mar 2018, 19:30 Enrico Olivelli <eolivelli@xxxxxxxxx> ha
>> scritto:
>> >
>> > > Thank you Ivan!
>> > > I hope I did not mess up the dump and added ZK ports.  We are not using
>> > > standard ports and in that 3  machines there is also the 3 nodes zk
>> > > ensemble which is supporting BK and all the other parts of the
>> > application
>> > >
>> > > So one explanation would be that something is connecting to the bookie
>> > and
>> > > this makes the bookie switch in a corrupted state by double releasing a
>> > > bytebuf?
>> > >
>> >
>> > I did some experiments and it is easy to reproduce the bookie side error
>> > and the double release with a forged sequence of bytes (just using nc
>> from
>> > the shell)
>> > But this seems not enough to break the bookie.
>> > I guess there is some corruption on client side and the error on the
>> bookie
>> > is only and effect, as Ivan is saying.
>> > My colleagues left the system running  with a deep level of debug during
>> > next weekend, hopefully we will get some other stacktrace
>> >
>> > Enrico
>> >
>> >
>> > > Enrico
>> > >
>> > >
>> > > Il ven 9 mar 2018, 18:23 Ivan Kelly <ivank@xxxxxxxxxx> ha scritto:
>> > >
>> > >> I need to sign off for the day. I've done some analysis of a tcpdump
>> > >> enrico sent to me out of band (may contain sensitive info so best not
>> > >> to post on public forum).
>> > >>
>> > >> I've attached a dump of just first bit of the header. Format is
>> > >> <sequence in dump> <whether a request or response>(<remote port>)
>> > >> <hexdump of payload>
>> > >>
>> > >> There are definitely corrupt packets coming from somewhere. Search for
>> > >> lines with CORRUPT.
>> > >>
>> > >> <snip>
>> > >> 0247 -  req (049546) - 00:00:00:08:ff:ff:ff:fe:00:00:00:0b    CORRUPT
>> > >> </snip>
>> > >>
>> > >> It's not clear whether these are originating at a valid client or not.
>> > >> These trigger corrupt responses from the server, which I guess is the
>> > >> double free manifesting itself. Strangely the
>> > >> corrupt message seems to have a lot of data in common with what seems
>> > >> like an ok message (it's clearer on fixed width font).
>> > >>
>> > >> <snip>
>> > >> 0248 -  resp(049720) -
>> > >>
>> > >>
>> > 00:00:00:54:00:03:00:89:00:00:02:86:00:07:e2:b1:00:00:00:00:
>> 00:00:02:86:00:05:e9:76:00:00
>> > >> 0249 -  resp(049546) -
>> > >> 00:00:00:10:ff:ff:ff:fe:00:00:02:86:00:07:e2:b1:00:00:00:00
>> CORRUPT
>> > >> </snip>
>> > >>
>> > >> There's also some other weird traffic. Correct BK protobuf traffic
>> > >> should be <4 bytes len>:00:03:....
>> > >> There seems to be other traffic which is being accepted at the same
>> > >> port, but looks like ZK traffic.
>> > >>
>> > >> Anyhow, I'll dig more on monday.
>> > >>
>> > >> -Ivan
>> > >>
>> > >>
>> > >> On Fri, Mar 9, 2018 at 3:27 PM, Ivan Kelly <ivank@xxxxxxxxxx> wrote:
>> > >> > On Fri, Mar 9, 2018 at 3:20 PM, Enrico Olivelli <
>> eolivelli@xxxxxxxxx>
>> > >> wrote:
>> > >> >> Bookies
>> > >> >> 10.168.10.117:1822 -> bad bookie with 4.1.21
>> > >> >> 10.168.10.116:1822 -> bookie with 4.1.12
>> > >> >> 10.168.10.118:1281 -> bookie with 4.1.12
>> > >> >>
>> > >> >> 10.168.10.117 client machine on which I have 4.1.21 client
>> (different
>> > >> >> process than the bookie one)
>> > >> > Oh. This dump won't have the stream we need then, as that will be on
>> > >> > loopback. Try adding "-i any" to the tcpdump. Sorry, I didn't
>> realize
>> > >> > your clients and servers are colocated.
>> > >> >
>> > >> > -Ivan
>> > >>
>> > > --
>> > >
>> > >
>> > > -- Enrico Olivelli
>> > >
>> > --
>> >
>> >
>> > -- Enrico Olivelli
>> >
>>