git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 1.5 some thing weird


Hi Vishal,

it looks as if the flushing of the checkpoint data to HDFS failed due to some expired lease on the checkpoint file. Therefore, Flink aborted the checkpoint `chk-125` and removed it. This is the normal behaviour if Flink cannot complete a checkpoint. As you can see, afterwards, the checkpoints are again successful.

Cheers,
Till

On Mon, Jul 9, 2018 at 7:15 PM Vishal Santoshi <vishal.santoshi@xxxxxxxxx> wrote:
drwxr-xr-x   - root hadoop          0 2018-07-09 12:33 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-123
drwxr-xr-x   - root hadoop          0 2018-07-09 12:35 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-124
drwxr-xr-x   - root hadoop          0 2018-07-09 12:51 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-126
drwxr-xr-x   - root hadoop          0 2018-07-09 12:53 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-127
drwxr-xr-x   - root hadoop          0 2018-07-09 12:55 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-128

See the missing chk-125

So I see the above checkpoints for a job. at the  2018-07-09, 12:38:43   this exception was thrown


the  chk-125 is missing from hdfs and the job complains about it
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e (inode 1987098987): File does not exist. Holder DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.

At about the same time 

ID: 125Failure Time: 12:38:23Cause: Checkpoint expired before completing..


Is this some race condition. A checkpoint had to be taken and , that was was chk-125, it took longer than the configure time ( 1 minute ).  It aborted the pipe. Should it have ? It actually did not even create the chk-125 but then refers to it and aborts the pipe.






 

This is the full exception.
AsynchronousException{java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).}
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1154)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:948)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:885)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:943)
	... 6 more
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)
	at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:47)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:854)
	... 5 more
Caused by: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle
	at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:325)
	at org.apache.flink.runtime.state.CheckpointStreamWithResultProvider$PrimaryStreamOnly.closeAndFinalizeCheckpointStreamResult(CheckpointStreamWithResultProvider.java:77)
	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:705)
	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:641)
	at org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e (inode 1987098987): File does not exist. Holder DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.