git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Help with bad errors on 4.6.1


I will send a report soon.
With new debug I have some finding, I am looking into problems during
restarts of bookies. Maybe there is some problem in error handling in PCBC.

Thank you
Enrico

2018-03-12 10:58 GMT+01:00 Ivan Kelly <ivank@xxxxxxxxxx>:

> Enrico, could you summarize what the state of things is now? What are
> you running, what problems are you seeing and how are the problems
> manifesting themselves.
>
> Regards,
> Ivan
>
> On Mon, Mar 12, 2018 at 10:15 AM, Enrico Olivelli <eolivelli@xxxxxxxxx>
> wrote:
> > Applyed Sijie's fixes and added some debug:
> >
> > Problem is triggered when you restart a bookie (I have a cluster of 3
> > bookies, WQ = 2 and AQ = 2)
> >
> > Below a new error on client side ("tailing" reader)
> >
> > Enrico
> >
> > this is a new error on client side:
> > 18-03-12-09-11-45    Unexpected exception caught by bookie client channel
> > handler
> > 18-03-12-09-11-45    io.netty.handler.codec.DecoderException:
> > com.google.protobuf.InvalidProtocolBufferException: Protocol message
> > contained an invalid tag (zero).
> > io.netty.handler.codec.DecoderException:
> > com.google.protobuf.InvalidProtocolBufferException: Protocol message
> > contained an invalid tag (zero).
> >     at
> > io.netty.handler.codec.MessageToMessageDecoder.channelRead(
> MessageToMessageDecoder.java:98)
> >     at
> > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
> AbstractChannelHandlerContext.java:362)
> >     at
> > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
> AbstractChannelHandlerContext.java:348)
> >     at
> > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(
> AbstractChannelHandlerContext.java:340)
> >     at
> > io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
> ByteToMessageDecoder.java:310)
> >     at
> > io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
> ByteToMessageDecoder.java:297)
> >     at
> > io.netty.handler.codec.ByteToMessageDecoder.callDecode(
> ByteToMessageDecoder.java:413)
> >     at
> > io.netty.handler.codec.ByteToMessageDecoder.channelRead(
> ByteToMessageDecoder.java:265)
> >     at
> > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
> AbstractChannelHandlerContext.java:362)
> >     at
> > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
> AbstractChannelHandlerContext.java:348)
> >     at
> > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(
> AbstractChannelHandlerContext.java:340)
> >     at
> > io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(
> DefaultChannelPipeline.java:1334)
> >     at
> > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
> AbstractChannelHandlerContext.java:362)
> >     at
> > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
> AbstractChannelHandlerContext.java:348)
> >     at
> > io.netty.channel.DefaultChannelPipeline.fireChannelRead(
> DefaultChannelPipeline.java:926)
> >     at
> > io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.
> epollInReady(AbstractEpollStreamChannel.java:979)
> >     at
> > io.netty.channel.epoll.EpollEventLoop.processReady(
> EpollEventLoop.java:401)
> >     at io.netty.channel.epoll.EpollEventLoop.run(
> EpollEventLoop.java:306)
> >     at
> > io.netty.util.concurrent.SingleThreadEventExecutor$5.
> run(SingleThreadEventExecutor.java:858)
> >     at java.base/java.lang.Thread.run(Thread.java:844)
> > Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol
> > message contained an invalid tag (zero).
> >     at
> > com.google.protobuf.InvalidProtocolBufferException.invalidTag(
> InvalidProtocolBufferException.java:105)
> >     at
> > com.google.protobuf.CodedInputStream$StreamDecoder.readTag(
> CodedInputStream.java:2060)
> >     at
> > org.apache.bookkeeper.proto.BookkeeperProtocol$ReadResponse.<init>(
> BookkeeperProtocol.java:10009)
> >     at
> > org.apache.bookkeeper.proto.BookkeeperProtocol$ReadResponse.<init>(
> BookkeeperProtocol.java:9975)
> >     at
> > org.apache.bookkeeper.proto.BookkeeperProtocol$ReadResponse$1.
> parsePartialFrom(BookkeeperProtocol.java:10869)
> >     at
> > org.apache.bookkeeper.proto.BookkeeperProtocol$ReadResponse$1.
> parsePartialFrom(BookkeeperProtocol.java:10864)
> >     at
> > com.google.protobuf.CodedInputStream$StreamDecoder.readMessage(
> CodedInputStream.java:2362)
> >     at
> > org.apache.bookkeeper.proto.BookkeeperProtocol$Response.<
> init>(BookkeeperProtocol.java:7852)
> >     at
> > org.apache.bookkeeper.proto.BookkeeperProtocol$Response.<
> init>(BookkeeperProtocol.java:7782)
> >     at
> > org.apache.bookkeeper.proto.BookkeeperProtocol$Response$1.
> parsePartialFrom(BookkeeperProtocol.java:9887)
> >     at
> > org.apache.bookkeeper.proto.BookkeeperProtocol$Response$1.
> parsePartialFrom(BookkeeperProtocol.java:9882)
> >     at
> > com.google.protobuf.AbstractParser.parsePartialFrom(
> AbstractParser.java:221)
> >     at com.google.protobuf.AbstractParser.parseFrom(
> AbstractParser.java:239)
> >     at com.google.protobuf.AbstractParser.parseFrom(
> AbstractParser.java:49)
> >     at
> > com.google.protobuf.GeneratedMessageV3.parseWithIOException(
> GeneratedMessageV3.java:320)
> >     at
> > org.apache.bookkeeper.proto.BookkeeperProtocol$Response.
> parseFrom(BookkeeperProtocol.java:8454)
> >     at
> > org.apache.bookkeeper.proto.BookieProtoEncoding$
> ResponseEnDecoderV3.decode(BookieProtoEncoding.java:329)
> >     at
> > org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseDecoder.decode(
> BookieProtoEncoding.java:470)
> >     at
> > io.netty.handler.codec.MessageToMessageDecoder.channelRead(
> MessageToMessageDecoder.java:88)
> >     ... 19 more
> >
> >
> > 2018-03-09 22:37 GMT+01:00 Sijie Guo <guosijie@xxxxxxxxx>:
> >
> >> Enrico,
> >>
> >> I would suggest you applied my fixes and then debug from there. In this
> >> way, you will have a better sense where the first corruption is from.
> >>
> >> Sijie
> >>
> >> On Fri, Mar 9, 2018 at 11:48 AM Enrico Olivelli <eolivelli@xxxxxxxxx>
> >> wrote:
> >>
> >> > Il ven 9 mar 2018, 19:30 Enrico Olivelli <eolivelli@xxxxxxxxx> ha
> >> scritto:
> >> >
> >> > > Thank you Ivan!
> >> > > I hope I did not mess up the dump and added ZK ports.  We are not
> using
> >> > > standard ports and in that 3  machines there is also the 3 nodes zk
> >> > > ensemble which is supporting BK and all the other parts of the
> >> > application
> >> > >
> >> > > So one explanation would be that something is connecting to the
> bookie
> >> > and
> >> > > this makes the bookie switch in a corrupted state by double
> releasing a
> >> > > bytebuf?
> >> > >
> >> >
> >> > I did some experiments and it is easy to reproduce the bookie side
> error
> >> > and the double release with a forged sequence of bytes (just using nc
> >> from
> >> > the shell)
> >> > But this seems not enough to break the bookie.
> >> > I guess there is some corruption on client side and the error on the
> >> bookie
> >> > is only and effect, as Ivan is saying.
> >> > My colleagues left the system running  with a deep level of debug
> during
> >> > next weekend, hopefully we will get some other stacktrace
> >> >
> >> > Enrico
> >> >
> >> >
> >> > > Enrico
> >> > >
> >> > >
> >> > > Il ven 9 mar 2018, 18:23 Ivan Kelly <ivank@xxxxxxxxxx> ha scritto:
> >> > >
> >> > >> I need to sign off for the day. I've done some analysis of a
> tcpdump
> >> > >> enrico sent to me out of band (may contain sensitive info so best
> not
> >> > >> to post on public forum).
> >> > >>
> >> > >> I've attached a dump of just first bit of the header. Format is
> >> > >> <sequence in dump> <whether a request or response>(<remote port>)
> >> > >> <hexdump of payload>
> >> > >>
> >> > >> There are definitely corrupt packets coming from somewhere. Search
> for
> >> > >> lines with CORRUPT.
> >> > >>
> >> > >> <snip>
> >> > >> 0247 -  req (049546) - 00:00:00:08:ff:ff:ff:fe:00:00:00:0b
> CORRUPT
> >> > >> </snip>
> >> > >>
> >> > >> It's not clear whether these are originating at a valid client or
> not.
> >> > >> These trigger corrupt responses from the server, which I guess is
> the
> >> > >> double free manifesting itself. Strangely the
> >> > >> corrupt message seems to have a lot of data in common with what
> seems
> >> > >> like an ok message (it's clearer on fixed width font).
> >> > >>
> >> > >> <snip>
> >> > >> 0248 -  resp(049720) -
> >> > >>
> >> > >>
> >> > 00:00:00:54:00:03:00:89:00:00:02:86:00:07:e2:b1:00:00:00:00:
> >> 00:00:02:86:00:05:e9:76:00:00
> >> > >> 0249 -  resp(049546) -
> >> > >> 00:00:00:10:ff:ff:ff:fe:00:00:02:86:00:07:e2:b1:00:00:00:00
> >> CORRUPT
> >> > >> </snip>
> >> > >>
> >> > >> There's also some other weird traffic. Correct BK protobuf traffic
> >> > >> should be <4 bytes len>:00:03:....
> >> > >> There seems to be other traffic which is being accepted at the same
> >> > >> port, but looks like ZK traffic.
> >> > >>
> >> > >> Anyhow, I'll dig more on monday.
> >> > >>
> >> > >> -Ivan
> >> > >>
> >> > >>
> >> > >> On Fri, Mar 9, 2018 at 3:27 PM, Ivan Kelly <ivank@xxxxxxxxxx>
> wrote:
> >> > >> > On Fri, Mar 9, 2018 at 3:20 PM, Enrico Olivelli <
> >> eolivelli@xxxxxxxxx>
> >> > >> wrote:
> >> > >> >> Bookies
> >> > >> >> 10.168.10.117:1822 -> bad bookie with 4.1.21
> >> > >> >> 10.168.10.116:1822 -> bookie with 4.1.12
> >> > >> >> 10.168.10.118:1281 -> bookie with 4.1.12
> >> > >> >>
> >> > >> >> 10.168.10.117 client machine on which I have 4.1.21 client
> >> (different
> >> > >> >> process than the bookie one)
> >> > >> > Oh. This dump won't have the stream we need then, as that will
> be on
> >> > >> > loopback. Try adding "-i any" to the tcpdump. Sorry, I didn't
> >> realize
> >> > >> > your clients and servers are colocated.
> >> > >> >
> >> > >> > -Ivan
> >> > >>
> >> > > --
> >> > >
> >> > >
> >> > > -- Enrico Olivelli
> >> > >
> >> > --
> >> >
> >> >
> >> > -- Enrico Olivelli
> >> >
> >>
>