git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Help with bad errors on 4.6.1


Very latest news:
I have narrowed the problem to ResponseEnDecoderV3#encode, using
UnpooledByteBufAllocator.DEFAULT instead of the allocator from the channel
the error disappear.

So the problem is about the encoding of the responses, using Java 9 and
Pooled Byte Bufs.
This is compatible with the errors on the client side about corrupted
responses in case of Client on Java8 and Server on Java9.

I am now doing tests with Bookie on Java 8 and Clients on Java 9 and the
problem seems the same, I receive corrupted messages on Bookie.

Does any ring bell ?

What is the difference in Channel#write/ByteBuf pooling.....in Java 9 ?

Enrico







2018-03-15 5:21 GMT+01:00 Enrico Olivelli <eolivelli@xxxxxxxxx>:

> Latest findings, some good news, and some very bad.
>
> Good news:
> I was wrong, I did not switch back the system to Java 8 correcly.
>
> The problem is on Bookie side and occours only if the bookie in on Java 9.
>
> Bad news:
> I have a fix. The fix to use Unpooled ByteBufs in serializeProtobuf:
>
> private static ByteBuf serializeProtobuf(MessageLite msg, ByteBufAllocator
> allocator) {
>         int size = msg.getSerializedSize();
>         ByteBuf buf = Unpooled.buffer(size, size);
> ...
>
> I will continue to track down to the cause, I think it is on the read-path
> (not sure).
>
> On client side we have a flag to not use pooled ByteBufs on Channel
> Allocator, the most trivial fix at the moment is to make the same on Bookie
> side as an hotfix for branch 4.6.
>
> Before jumping to this extreme hotfix solution I will dig into the issue,
> now that I know that the problem is ONLY on Java 9 and on the Bookie it
> will be simpler to find a reproducer.
>
> It remains the point that in other systems I have and in test cases there
> is no failure.
>
> Honestly I have no Java 9 bookie in production, only Java 8 bookies, maybe
> this is the motivation of the fact that no one ever reported this problem
> from production
>
> Enrico
>
>
>
>
> 2018-03-14 17:27 GMT+01:00 Ivan Kelly <ivank@xxxxxxxxxx>:
>
>> >> > @Ivan
>> >> > I wonder if some tests on Jepsen with bookie restarts may find this
>> kind
>> >> of
>> >> > issues, given that it is not a network/SO problem
>> >> If jepsen can catch then normal integration test can.
>>
>> I attempted a repro for this using the integration test stuff.
>> Running for 2-3 hours in a loop, no bug hit. Perhaps I'm not doing
>> exactly what you are doing.
>>
>> https://github.com/ivankelly/bookkeeper/blob/enrico-bug/test
>> s/integration/enrico-bug/src/test/java/org/apache/bookkeepe
>> r/tests/integration/TestEnricoBug.java
>>
>> -Ivan
>>
>
>