Hi Hequn & Kien,
Finally the problem is solved.
It is due to slow sink write. Because the job only have 2 tasks, I check the backpressure, found that the source has high backpressure, so I tried to improve the sink write. After that the end to end duration is below 1s and the checkpoint timeout is fixed.
Hequn & Kien,
Thanks a lot for your help, I will try it later.
@Kien is right. Take a thread dump to see what was doing in the TaskManager. Also check whether gc happens frequently.
I am running a flink application with parallelism 64, I left the checkpoint timeout default value, which is 10minutes, the state size is less than 1MB, I am using the FsStateBackend.
The application triggers some checkpoints but all of them fails due to "Checkpoint expired before completing”, I check the checkpoint history, found that there are 63 subtask acknowledge, but one left n/a, and also the alignment duration is quite long, about 5m27s.
I want to know why there is one subtask does not acknowledge? And because the alignment duration is long, what will influent the alignment duration?
Thank a lot.