git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

回复:checkpoint/taskmanager is stuck, deadlock on LocalBufferPool


From the stack below, it indicates there are no available buffers for source outputs including watermark and normal records, so the source will be blocked on request buffer from LocalBufferPool.
The checkpoint process is also affected by above blocking request. The root cause is why the queued output buffers are not consumed by downstream tasks.
I think you can check the downstream task which inqueue usage should reach 100%, then jstack the corresponding downstream tasks that may stuck in some operations to cause back pressure.

Best,
Zhijiang
------------------------------------------------------------------
发件人:Yan Zhou [FDS Science] <yzhou@xxxxxxxxxxx>
发送时间:2018年10月23日(星期二) 02:29
收件人:user@xxxxxxxxxxxxxxxx <user@xxxxxxxxxxxxxxxx>
主 题:Re: checkpoint/taskmanager is stuck, deadlock on LocalBufferPool

I am using flink 1.5.3 

From: Yan Zhou [FDS Science] <yzhou@xxxxxxxxxxx>
Sent: Monday, October 22, 2018 11:26
To: user@xxxxxxxxxxxxxxxx
Subject: checkpoint/taskmanager is stuck, deadlock on LocalBufferPool
 
Hi,
My application suddenly stuck and completely doesn't move forward after running for a few days. No exceptions are found. From the thread dump, I can see that the operator threads and checkpoint threads deadlock on LocalBufferPool. LocalBufferPool is not able to request memory and keep the lock. Please see the thread dump at the bottom. 

It uses rocksdb as statebackend. From the heap dump and web ui, there are plenty of memory in jvm and it doesn't have GC problem. Check points were good until there was the problem:

2018-10-19 04:41:23,691 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering checkpoint 4347 @ 1539891683667 for job 1.
2018-10-19 04:41:45,069 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed checkpoint 4347 for job 1 (1019729450 bytes in 13600 ms).
2018-10-19 04:46:45,089 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering checkpoint 4348 @ 1539892005069 for job 1.
2018-10-19 04:56:45,089 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Checkpoint 4348 of job 1 expired before completing.


This happened at mid night and the traffic was relatively low. Even if there was a spike and caused a back pressure, to my understand that the events should be processed  eventually and the network buffer would be available after that. What might be the cause of it? 


Best
Yan

"Time Trigger for Source: Custom Source -> Flat Map -> Flat Map -> Timestamps/Watermarks -> (from: ...s#363 daemon prio=5 os_prio=0 tid=0x00007ff187944000 nid=0x8f76 in Object.wait() [0x00007ff12fda9000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247)
- locked <0x00000006dadeeac8> (a java.util.ArrayDeque)
at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204)
at org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213)
at org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144)
at org.apache.flink.runtime.io.network.api.writer.RecordWriter.broadcastEmit(RecordWriter.java:117)
at org.apache.flink.streaming.runtime.io.StreamRecordWriter.broadcastEmit(StreamRecordWriter.java:87)
at org.apache.flink.streaming.runtime.io.RecordWriterOutput.emitWatermark(RecordWriterOutput.java:121)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.emitWatermark(AbstractStreamOperator.java:668)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:736)
at org.apache.flink.streaming.api.operators.ProcessOperator.processWatermark(ProcessOperator.java:72)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitWatermark(OperatorChain.java:479)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.emitWatermark(AbstractStreamOperator.java:668)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:736)
at org.apache.flink.streaming.api.operators.ProcessOperator.processWatermark(ProcessOperator.java:72)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitWatermark(OperatorChain.java:479)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.emitWatermark(OperatorChain.java:603)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.emitWatermark(AbstractStreamOperator.java:668)
at org.apache.flink.streaming.runtime.operators.TimestampsAndPeriodicWatermarksOperator.onProcessingTime(TimestampsAndPeriodicWatermarksOperator.java:77)
at org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService$TriggerTask.run(SystemProcessingTimeService.java:285)
- locked <0x00000006dcb3db50> (a java.lang.Object)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
- <0x00000006eac38310> (a java.util.concurrent.ThreadPoolExecutor$Worker)


"Source: Custom Source -> Flat Map -> Flat Map -> Timestamps/Watermarks -> (from: ...s#116 prio=5 os_prio=0 tid=0x00007ff1a791a000 nid=0x78c9 waiting for monitor entry [0x00007ff14f6da000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecordWithTimestamp(AbstractFetcher.java:397)
- waiting to lock <0x00000006dcb3db50> (a java.lang.Object)
at org.apache.flink.streaming.connectors.kafka.internal.Kafka010Fetcher.emitRecord(Kafka010Fetcher.java:89)
at org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.runFetchLoop(Kafka09Fetcher.java:154)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:738)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:56)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:99)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:306)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:712)
at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
- None


"Async calls on Source: Custom Source -> Flat Map -> Flat Map -> Timestamps/Watermarks -> (from: (... sel#409 daemon prio=5 os_prio=0 tid=0x00007ff1682fc800 nid=0x943c waiting for monitor entry [0x00007ff0dae3a000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:620)
- waiting to lock <0x00000006dcb3db50> (a java.lang.Object)
at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:564)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.triggerCheckpoint(SourceStreamTask.java:116)
at org.apache.flink.runtime.taskmanager.Task$1.run(Task.java:1229)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
- <0x00000006ee7095f0> (a java.util.concurrent.ThreadPoolExecutor$Worker)