git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Help with bad errors on 4.6.1


Thanks Enrico!

On Mon, Mar 12, 2018 at 4:21 AM, Enrico Olivelli <eolivelli@xxxxxxxxx>
wrote:

> Summary of my findings:
>
> The problem is about clients which get messed up and are not able to read
> and write to bookies after rolling restarts of an application,
> the problem appears only on a cluster of 6 machines (reduced to 3 in order
> to narrow down the search) of my colleagues which are performing "manual"
> tests on a new version of an application, no changes at BK client level are
> present in the application
>

Does the problem only exist in that cluster? Or have you tried other
machines with same setup?

As what you said, you didn't see this error in your production. I am
wondering the corruption is coming from your hardware/os.


- Sijie



>
> I am able to reproduce the problem quite simply on the those machines, it
> is enough to restart bookies to mess the cluster, this is my reproducer:
> 1) there is a client which is continuously writing to BK, on a WAL (same
> pattern as in the BK tutorial)
> 2) there are two other clients (other machines) which are 'tailing' the WAL
> (as in the tutorial), these are the 'readers' which have the problem
> 3) there are 3 bookies, co-located in the same machines as the clients
> 4) running BK 4.6.1  on both clients and servers, using v3 protocol
> 5) ledgers are written with WQ=2 and AQ=2 (over a cluster of 3)
> 6) restart one or two bookies and the readers start being prevented to
> read: readers start receiving invalid v2 responses (com.google.protobuf.
> InvalidProtocolBufferException: Protocol message contained an invalid tag
> (zero))  [before Sijie's patch this was "Invalid opcode..."]
> 7) on bookie side we have the error below,
> "com.google.protobuf.InvalidProtocolBufferException
> $InvalidWireTypeException:
> Protocol message tag had invalid wire type",
> BookieProtoEncoding#RequestDecoder falls back to v2 protocol and start
> answering with wrong encoding
>
> I am not able to create a reproducer on BK test suite.
>
> Notes:
> - when I "restart" bookies I issue a kill -9 (I think this could be the
> reason why I can't reproduce the issue on testcases)
> - the issue starts when bookie comes up again
> - when a reader is promoted to leader (writer) it is no more able to write
> entries, all writes time out
> - I am running on Linux with Epool, no Auth, no TLS
> - Ivan took a look and some network dumps and found corrupted messages
> (thank you Ivan !!)
>
> Beside Additional (not important) notes from the history of this email
> thread:
> - I have never seen this kind of errors in production, and we have been
> running 4.6.1 in production for a month
> - I am running with jdk9 but even reverting to jdk8 the problem is the same
> - I was running with Netty 4.1.21Final but I have reverted to 4.1.12Final
> and the problem is the same
>
> I think it may be an issue on PerChannelBookieClient, but to me it is not
> very clear how "connection resets" are handled, I am digging into code.
>
> Bolow there is the stacktrace or the error on bookie side
>
> (line numbers may not match 4.6.1 tag because I have added plenty of debug,
> but we are at the point when BookieProtoEncoding#RequestDecoder switches
> to
> v2 protocol)
>
> Thank you very much to all of you that is helping with this problem
>
> Enrico
>
>
> 18-03-12-11-08-34    error decoding msg PooledSlicedByteBuf(ridx: 145,
> widx: 145, cap: 145/145, unwrapped: PooledUnsafeDirectByteBuf(ridx: 149,
> widx: 298, cap: 65536))
> 18-03-12-11-08-34
> com.google.protobuf.InvalidProtocolBufferException
> $InvalidWireTypeException:
> Protocol message tag had invalid wire type.
> com.google.protobuf.InvalidProtocolBufferException
> $InvalidWireTypeException:
> Protocol message tag had invalid wire type.
>     at
> com.google.protobuf.InvalidProtocolBufferException.invalidWireType(
> InvalidProtocolBufferException.java:115)
>     at
> com.google.protobuf.UnknownFieldSet$Builder.mergeFieldFrom(
> UnknownFieldSet.java:551)
>     at
> com.google.protobuf.GeneratedMessageV3.parseUnknownField(
> GeneratedMessageV3.java:293)
>     at
> org.apache.bookkeeper.proto.BookkeeperProtocol$AddRequest.
> <init>(BookkeeperProtocol.java:4536)
>     at
> org.apache.bookkeeper.proto.BookkeeperProtocol$AddRequest.
> <init>(BookkeeperProtocol.java:4497)
>     at
> org.apache.bookkeeper.proto.BookkeeperProtocol$AddRequest$
> 1.parsePartialFrom(BookkeeperProtocol.java:5378)
>     at
> org.apache.bookkeeper.proto.BookkeeperProtocol$AddRequest$
> 1.parsePartialFrom(BookkeeperProtocol.java:5373)
>     at
> com.google.protobuf.CodedInputStream$StreamDecoder.readMessage(
> CodedInputStream.java:2362)
>     at
> org.apache.bookkeeper.proto.BookkeeperProtocol$Request.<
> init>(BookkeeperProtocol.java:1367)
>     at
> org.apache.bookkeeper.proto.BookkeeperProtocol$Request.<
> init>(BookkeeperProtocol.java:1296)
>     at
> org.apache.bookkeeper.proto.BookkeeperProtocol$Request$1.parsePartialFrom(
> BookkeeperProtocol.java:3265)
>     at
> org.apache.bookkeeper.proto.BookkeeperProtocol$Request$1.parsePartialFrom(
> BookkeeperProtocol.java:3260)
>     at
> com.google.protobuf.AbstractParser.parsePartialFrom(
> AbstractParser.java:221)
>     at com.google.protobuf.AbstractParser.parseFrom(
> AbstractParser.java:239)
>     at com.google.protobuf.AbstractParser.parseFrom(
> AbstractParser.java:49)
>     at
> com.google.protobuf.GeneratedMessageV3.parseWithIOException(
> GeneratedMessageV3.java:320)
>     at
> org.apache.bookkeeper.proto.BookkeeperProtocol$Request.
> parseFrom(BookkeeperProtocol.java:1905)
>     at
> org.apache.bookkeeper.proto.BookieProtoEncoding$RequestEnDecoderV3.decode(
> BookieProtoEncoding.java:309)
>     at
> org.apache.bookkeeper.proto.BookieProtoEncoding$RequestDecoder.decode(
> BookieProtoEncoding.java:408)
>     at
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(
> MessageToMessageDecoder.java:88)
>     at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
> AbstractChannelHandlerContext.java:362)
>     at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
> AbstractChannelHandlerContext.java:348)
>     at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(
> AbstractChannelHandlerContext.java:340)
>     at
> io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
> ByteToMessageDecoder.java:310)
>     at
> io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
> ByteToMessageDecoder.java:297)
>     at
> io.netty.handler.codec.ByteToMessageDecoder.callDecode(
> ByteToMessageDecoder.java:413)
>     at
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(
> ByteToMessageDecoder.java:265)
>     at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
> AbstractChannelHandlerContext.java:362)
>     at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
> AbstractChannelHandlerContext.java:348)
>     at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(
> AbstractChannelHandlerContext.java:340)
>     at
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(
> DefaultChannelPipeline.java:1334)
>     at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
> AbstractChannelHandlerContext.java:362)
>     at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
> AbstractChannelHandlerContext.java:348)
>     at
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(
> DefaultChannelPipeline.java:926)
>     at
> io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.
> epollInReady(AbstractEpollStreamChannel.java:979)
>     at
> io.netty.channel.epoll.EpollEventLoop.processReady(
> EpollEventLoop.java:401)
>     at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:306)
>     at
> io.netty.util.concurrent.SingleThreadEventExecutor$5.
> run(SingleThreadEventExecutor.java:858)
>     at java.base/java.lang.Thread.run(Thread.java:844)
>
>
>
>
> 2018-03-12 11:17 GMT+01:00 Enrico Olivelli <eolivelli@xxxxxxxxx>:
>
> > I will send a report soon.
> > With new debug I have some finding, I am looking into problems during
> > restarts of bookies. Maybe there is some problem in error handling in
> PCBC.
> >
> > Thank you
> > Enrico
> >
> > 2018-03-12 10:58 GMT+01:00 Ivan Kelly <ivank@xxxxxxxxxx>:
> >
> >> Enrico, could you summarize what the state of things is now? What are
> >> you running, what problems are you seeing and how are the problems
> >> manifesting themselves.
> >>
> >> Regards,
> >> Ivan
> >>
> >> On Mon, Mar 12, 2018 at 10:15 AM, Enrico Olivelli <eolivelli@xxxxxxxxx>
> >> wrote:
> >> > Applyed Sijie's fixes and added some debug:
> >> >
> >> > Problem is triggered when you restart a bookie (I have a cluster of 3
> >> > bookies, WQ = 2 and AQ = 2)
> >> >
> >> > Below a new error on client side ("tailing" reader)
> >> >
> >> > Enrico
> >> >
> >> > this is a new error on client side:
> >> > 18-03-12-09-11-45    Unexpected exception caught by bookie client
> >> channel
> >> > handler
> >> > 18-03-12-09-11-45    io.netty.handler.codec.DecoderException:
> >> > com.google.protobuf.InvalidProtocolBufferException: Protocol message
> >> > contained an invalid tag (zero).
> >> > io.netty.handler.codec.DecoderException:
> >> > com.google.protobuf.InvalidProtocolBufferException: Protocol message
> >> > contained an invalid tag (zero).
> >> >     at
> >> > io.netty.handler.codec.MessageToMessageDecoder.channelRead(M
> >> essageToMessageDecoder.java:98)
> >> >     at
> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannel
> >> Read(AbstractChannelHandlerContext.java:362)
> >> >     at
> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannel
> >> Read(AbstractChannelHandlerContext.java:348)
> >> >     at
> >> > io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
> >> ad(AbstractChannelHandlerContext.java:340)
> >> >     at
> >> > io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
> >> ByteToMessageDecoder.java:310)
> >> >     at
> >> > io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
> >> ByteToMessageDecoder.java:297)
> >> >     at
> >> > io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteT
> >> oMessageDecoder.java:413)
> >> >     at
> >> > io.netty.handler.codec.ByteToMessageDecoder.channelRead(Byte
> >> ToMessageDecoder.java:265)
> >> >     at
> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannel
> >> Read(AbstractChannelHandlerContext.java:362)
> >> >     at
> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannel
> >> Read(AbstractChannelHandlerContext.java:348)
> >> >     at
> >> > io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
> >> ad(AbstractChannelHandlerContext.java:340)
> >> >     at
> >> > io.netty.channel.DefaultChannelPipeline$HeadContext.
> >> channelRead(DefaultChannelPipeline.java:1334)
> >> >     at
> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannel
> >> Read(AbstractChannelHandlerContext.java:362)
> >> >     at
> >> > io.netty.channel.AbstractChannelHandlerContext.invokeChannel
> >> Read(AbstractChannelHandlerContext.java:348)
> >> >     at
> >> > io.netty.channel.DefaultChannelPipeline.fireChannelRead(Defa
> >> ultChannelPipeline.java:926)
> >> >     at
> >> > io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStrea
> >> mUnsafe.epollInReady(AbstractEpollStreamChannel.java:979)
> >> >     at
> >> > io.netty.channel.epoll.EpollEventLoop.processReady(EpollEven
> >> tLoop.java:401)
> >> >     at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.
> >> java:306)
> >> >     at
> >> > io.netty.util.concurrent.SingleThreadEventExecutor$5.run(
> >> SingleThreadEventExecutor.java:858)
> >> >     at java.base/java.lang.Thread.run(Thread.java:844)
> >> > Caused by: com.google.protobuf.InvalidProtocolBufferException:
> Protocol
> >> > message contained an invalid tag (zero).
> >> >     at
> >> > com.google.protobuf.InvalidProtocolBufferException.
> >> invalidTag(InvalidProtocolBufferException.java:105)
> >> >     at
> >> > com.google.protobuf.CodedInputStream$StreamDecoder.readTag(C
> >> odedInputStream.java:2060)
> >> >     at
> >> > org.apache.bookkeeper.proto.BookkeeperProtocol$ReadResponse.
> >> <init>(BookkeeperProtocol.java:10009)
> >> >     at
> >> > org.apache.bookkeeper.proto.BookkeeperProtocol$ReadResponse.
> >> <init>(BookkeeperProtocol.java:9975)
> >> >     at
> >> > org.apache.bookkeeper.proto.BookkeeperProtocol$ReadResponse$
> >> 1.parsePartialFrom(BookkeeperProtocol.java:10869)
> >> >     at
> >> > org.apache.bookkeeper.proto.BookkeeperProtocol$ReadResponse$
> >> 1.parsePartialFrom(BookkeeperProtocol.java:10864)
> >> >     at
> >> > com.google.protobuf.CodedInputStream$StreamDecoder.
> >> readMessage(CodedInputStream.java:2362)
> >> >     at
> >> > org.apache.bookkeeper.proto.BookkeeperProtocol$Response.<ini
> >> t>(BookkeeperProtocol.java:7852)
> >> >     at
> >> > org.apache.bookkeeper.proto.BookkeeperProtocol$Response.<ini
> >> t>(BookkeeperProtocol.java:7782)
> >> >     at
> >> > org.apache.bookkeeper.proto.BookkeeperProtocol$Response$1.pa
> >> rsePartialFrom(BookkeeperProtocol.java:9887)
> >> >     at
> >> > org.apache.bookkeeper.proto.BookkeeperProtocol$Response$1.pa
> >> rsePartialFrom(BookkeeperProtocol.java:9882)
> >> >     at
> >> > com.google.protobuf.AbstractParser.parsePartialFrom(Abstract
> >> Parser.java:221)
> >> >     at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.
> >> java:239)
> >> >     at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.
> >> java:49)
> >> >     at
> >> > com.google.protobuf.GeneratedMessageV3.parseWithIOException(
> >> GeneratedMessageV3.java:320)
> >> >     at
> >> > org.apache.bookkeeper.proto.BookkeeperProtocol$Response.pars
> >> eFrom(BookkeeperProtocol.java:8454)
> >> >     at
> >> > org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseEnDe
> >> coderV3.decode(BookieProtoEncoding.java:329)
> >> >     at
> >> > org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseDeco
> >> der.decode(BookieProtoEncoding.java:470)
> >> >     at
> >> > io.netty.handler.codec.MessageToMessageDecoder.channelRead(M
> >> essageToMessageDecoder.java:88)
> >> >     ... 19 more
> >> >
> >> >
> >> > 2018-03-09 22:37 GMT+01:00 Sijie Guo <guosijie@xxxxxxxxx>:
> >> >
> >> >> Enrico,
> >> >>
> >> >> I would suggest you applied my fixes and then debug from there. In
> this
> >> >> way, you will have a better sense where the first corruption is from.
> >> >>
> >> >> Sijie
> >> >>
> >> >> On Fri, Mar 9, 2018 at 11:48 AM Enrico Olivelli <eolivelli@xxxxxxxxx
> >
> >> >> wrote:
> >> >>
> >> >> > Il ven 9 mar 2018, 19:30 Enrico Olivelli <eolivelli@xxxxxxxxx> ha
> >> >> scritto:
> >> >> >
> >> >> > > Thank you Ivan!
> >> >> > > I hope I did not mess up the dump and added ZK ports.  We are not
> >> using
> >> >> > > standard ports and in that 3  machines there is also the 3 nodes
> zk
> >> >> > > ensemble which is supporting BK and all the other parts of the
> >> >> > application
> >> >> > >
> >> >> > > So one explanation would be that something is connecting to the
> >> bookie
> >> >> > and
> >> >> > > this makes the bookie switch in a corrupted state by double
> >> releasing a
> >> >> > > bytebuf?
> >> >> > >
> >> >> >
> >> >> > I did some experiments and it is easy to reproduce the bookie side
> >> error
> >> >> > and the double release with a forged sequence of bytes (just using
> nc
> >> >> from
> >> >> > the shell)
> >> >> > But this seems not enough to break the bookie.
> >> >> > I guess there is some corruption on client side and the error on
> the
> >> >> bookie
> >> >> > is only and effect, as Ivan is saying.
> >> >> > My colleagues left the system running  with a deep level of debug
> >> during
> >> >> > next weekend, hopefully we will get some other stacktrace
> >> >> >
> >> >> > Enrico
> >> >> >
> >> >> >
> >> >> > > Enrico
> >> >> > >
> >> >> > >
> >> >> > > Il ven 9 mar 2018, 18:23 Ivan Kelly <ivank@xxxxxxxxxx> ha
> scritto:
> >> >> > >
> >> >> > >> I need to sign off for the day. I've done some analysis of a
> >> tcpdump
> >> >> > >> enrico sent to me out of band (may contain sensitive info so
> best
> >> not
> >> >> > >> to post on public forum).
> >> >> > >>
> >> >> > >> I've attached a dump of just first bit of the header. Format is
> >> >> > >> <sequence in dump> <whether a request or response>(<remote
> port>)
> >> >> > >> <hexdump of payload>
> >> >> > >>
> >> >> > >> There are definitely corrupt packets coming from somewhere.
> >> Search for
> >> >> > >> lines with CORRUPT.
> >> >> > >>
> >> >> > >> <snip>
> >> >> > >> 0247 -  req (049546) - 00:00:00:08:ff:ff:ff:fe:00:00:00:0b
> >> CORRUPT
> >> >> > >> </snip>
> >> >> > >>
> >> >> > >> It's not clear whether these are originating at a valid client
> or
> >> not.
> >> >> > >> These trigger corrupt responses from the server, which I guess
> is
> >> the
> >> >> > >> double free manifesting itself. Strangely the
> >> >> > >> corrupt message seems to have a lot of data in common with what
> >> seems
> >> >> > >> like an ok message (it's clearer on fixed width font).
> >> >> > >>
> >> >> > >> <snip>
> >> >> > >> 0248 -  resp(049720) -
> >> >> > >>
> >> >> > >>
> >> >> > 00:00:00:54:00:03:00:89:00:00:02:86:00:07:e2:b1:00:00:00:00:
> >> >> 00:00:02:86:00:05:e9:76:00:00
> >> >> > >> 0249 -  resp(049546) -
> >> >> > >> 00:00:00:10:ff:ff:ff:fe:00:00:02:86:00:07:e2:b1:00:00:00:00
> >> >> CORRUPT
> >> >> > >> </snip>
> >> >> > >>
> >> >> > >> There's also some other weird traffic. Correct BK protobuf
> traffic
> >> >> > >> should be <4 bytes len>:00:03:....
> >> >> > >> There seems to be other traffic which is being accepted at the
> >> same
> >> >> > >> port, but looks like ZK traffic.
> >> >> > >>
> >> >> > >> Anyhow, I'll dig more on monday.
> >> >> > >>
> >> >> > >> -Ivan
> >> >> > >>
> >> >> > >>
> >> >> > >> On Fri, Mar 9, 2018 at 3:27 PM, Ivan Kelly <ivank@xxxxxxxxxx>
> >> wrote:
> >> >> > >> > On Fri, Mar 9, 2018 at 3:20 PM, Enrico Olivelli <
> >> >> eolivelli@xxxxxxxxx>
> >> >> > >> wrote:
> >> >> > >> >> Bookies
> >> >> > >> >> 10.168.10.117:1822 -> bad bookie with 4.1.21
> >> >> > >> >> 10.168.10.116:1822 -> bookie with 4.1.12
> >> >> > >> >> 10.168.10.118:1281 -> bookie with 4.1.12
> >> >> > >> >>
> >> >> > >> >> 10.168.10.117 client machine on which I have 4.1.21 client
> >> >> (different
> >> >> > >> >> process than the bookie one)
> >> >> > >> > Oh. This dump won't have the stream we need then, as that will
> >> be on
> >> >> > >> > loopback. Try adding "-i any" to the tcpdump. Sorry, I didn't
> >> >> realize
> >> >> > >> > your clients and servers are colocated.
> >> >> > >> >
> >> >> > >> > -Ivan
> >> >> > >>
> >> >> > > --
> >> >> > >
> >> >> > >
> >> >> > > -- Enrico Olivelli
> >> >> > >
> >> >> > --
> >> >> >
> >> >> >
> >> >> > -- Enrico Olivelli
> >> >> >
> >> >>
> >>
> >
> >
>