[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Deadlock in SafetyNetCloseableRegistry?

Hi, all

Sorry for attaching this again. The flink version is 1.6 and the dead lock stack is 

"CloseableReaperThread" #54 daemon prio=5 os_prio=0 tid=0x00007f4d6d3af000 nid=0x32f6 in Object.wait() [0x00007f4d3fdfe000]

   java.lang.Thread.State: WAITING (on object monitor)

at java.lang.Object.wait(Native Method)

- waiting on <0x00000000aefacb70> (a java.lang.ref.ReferenceQueue$Lock)

at java.lang.ref.ReferenceQueue.remove(

- locked <0x00000000aefacb70> (a java.lang.ref.ReferenceQueue$Lock)

at java.lang.ref.ReferenceQueue.remove(

at org.apache.flink.core.fs.SafetyNetCloseableRegistry$

       This thread is created in AsyncCheckpointRunnable class and get stucked, so the next checkpoint can’t aquire the lock in performCheckpoint method and timeout. How can I avoid this?

       Best, Jiayi Liao

 Original Message 
Sender: bupt_ljy<bupt_ljy@xxxxxxx>
Recipient: user<user@xxxxxxxxxxxxxxxx>
Date: Tuesday, Sep 11, 2018 22:22
Subject: Deadlock in SafetyNetCloseableRegistry?


   I starts a flink program and it runs on yarn. At first it doesn’t aquire enough resources so this is thrown.

“org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 16, slots allocated: 7”.

  Then the jobmanager automatically restarts but fail to trigger checkpoint anymore because “expired before completing”. All the taskmanagers are blocked, and I find there seems to be a dead lock in SafetyNetCloseableRegistry, and maybe that’s why the whole taskmanager is blocked. Here is the taskmanager’s stack:


  Best, Jiayi Liao