git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Flink 1.6 Job fails with IllegalStateException: Buffer pool is destroyed.


Thank you for you advice. I had not noticed that the log level was set to WARN.
INFO logs suggest that the job fails because of akka timeout and the root cause is long gc pause.

On Fri, Sep 7, 2018 at 5:43 PM Zhijiang(wangzhijiang999) <wangzhijiang999@xxxxxxxxxx> wrote:
You may need to config at least INFO level for logger in flink, and currently the messages are so limited for debugging the problem.

Best,
Zhijiang
------------------------------------------------------------------
发件人:杨力 <bill.lee.y@xxxxxxxxx>
发送时间:2018年9月7日(星期五) 17:21
收件人:Zhijiang(wangzhijiang999) <wangzhijiang999@xxxxxxxxxx>
主 题:Re: Flink 1.6 Job fails with IllegalStateException: Buffer pool is destroyed.

I have checked logs from yarn nodemanagers, and there are no killing action record. There are no job canceling record in jobmanager's log either.

Here are job logs retrieved from yarn.

https://pastebin.com/raw/1yHLYR65

Zhijiang(wangzhijiang999) <wangzhijiang999@xxxxxxxxxx> 于 2018年9月7日周五 下午3:22写道:
Hi,

I think the problem in the attched image is not the root cause of your job failure. It must exist other task or TaskManager failures, then all the related tasks will be cancelled by job manager, and the problem in attched image is just caused by task cancelled.

You can review the log of job manager to check whether there are any failures to cause failing the whole job.
FYI, the task manager may be killed by yarn because of memory exceed. You mentioned the job fails in half an hour after starts, so I guess it exits the possibility that the task manager is killed by yarn.

Best,
Zhijiang
------------------------------------------------------------------
发件人:杨力 <bill.lee.y@xxxxxxxxx>
发送时间:2018年9月7日(星期五) 13:09
收件人:user <user@xxxxxxxxxxxxxxxx>
主 题:Flink 1.6 Job fails with IllegalStateException: Buffer pool is destroyed.

Hi all,
I am encountering a weird problem when running flink 1.6 in yarn per-job clusters.
The job fails in about half an hour after it starts. Related logs is attached as an imange.

This piece of log comes from one of the taskmanagers. There are not any other related log lines.
No ERROR-level logs. The job just runs for tens of minutes without printing any logs
and suddenly throws this exception.

It is reproducable in my production environment, but not in my test environment.
The 'Buffer pool is destroed' exception is always thrown while emitting latency marker.

cy marker.