I starts a flink program and it runs on yarn. At first it doesn’t aquire enough resources so this is thrown.
“org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 16, slots allocated: 7”.
Then the jobmanager automatically restarts but fail to trigger checkpoint anymore because “expired before completing”. All the taskmanagers are blocked, and I find there seems to be a dead lock in SafetyNetCloseableRegistry, and maybe that’s why the whole taskmanager is blocked. Here is the taskmanager’s stack:
Best, Jiayi Liao
Description: Binary data