[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: RocksDB State Backend Exception

Hi Ning,

The problem here first of all is that RocksDB java JNI client diverged from RocksDB cpp code in status.h,
as mentioned in the Flink issue you refer to.

Flink 1.6 uses RocksDB 5.7.5 java client. 
The JNI code there misses these status subcodes:
kNoSpace = 4,
kDeadlock = 5,
kStaleFile = 6,
kMemoryLimit = 7
which could be potential problems in the job.

kNoSpace is only one of them.
Another probable cause could be kStaleFile, some file system IO problem.
kDeadlock seems to be used only with transactions, so not relevant.
kMemoryLimit means that write batch exceeded max size, but we do not have limit for it as I understand.

It would be easier to debug if RocksDB JNI client would at least log the unknown subcode but i do not see any easy way to log it in the current version, without rebuilding RocksDB and subsequently Flink.

In master branch, java Status and status.h are also unsynced. You could report this issue in RocksDB repo, along with extending exception logging message with the number of unknown error code. Flink community plans to upgrade to the latest RocksDB version again in one of the next Flink releases.


> On 25 Oct 2018, at 04:31, Ning Shi <ningshi2@xxxxxxxxx> wrote:
> Hi,
> We are doing some performance testing on a 12 node cluster with 8 task
> slots per TM. Every 15 minutes or so, the job would run into the
> following exception.
> java.lang.IllegalArgumentException: Illegal value provided for SubCode.
> 	at org.rocksdb.Status$SubCode.getSubCode(
> 	at org.rocksdb.Status.<init>(
> 	at org.rocksdb.RocksDB.put(Native Method)
> 	at org.rocksdb.RocksDB.put(
> 	at org.apache.flink.contrib.streaming.state.AbstractRocksDBAppendingState.updateInternal(
> 	at org.apache.flink.contrib.streaming.state.RocksDBReducingState.add(
> 	at org.apache.flink.streaming.runtime.operators.windowing.WindowOperator.processElement(
> 	at
> 	at
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(
> 	at
> 	at
> I saw an outstanding issue with similar exception in [1]. The ticket
> description suggests that it was due to out of disk error, but in our
> case, we have plenty of disk left on all TMs.
> Has anyone run into this before? If so, is there a fix or workaround?
> Thanks,
> [1]
> --
> Ning