[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Flink weird checkpointing behaviour


We have just upgraded to Flink 1.5.2 on EMR from Flink 1.3.2. We have noticed that some checkpoints are taking a very long time to complete some of them event fails with exception
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/jobmanager_0#-665361795]] after [60000 ms].

We have noticed that Checkpoint Duration (Async) is taking most of checkpoint time compared to Checkpoint Duration (Sync). I thought that Async checkpoints are only offered by RocksDB backend state. We use filesystem state.

We didn't have such problems on Flink 1.3.2


Flink configuration
akka.ask.timeout 60 s
classloader.resolve-order parent-first
containerized.heap-cutoff-ratio 0.15
env.hadoop.conf.dir /etc/hadoop/conf
env.yarn.conf.dir /etc/hadoop/conf
high-availability zookeeper
high-availability.cluster-id application_1540292869184_0001
high-availability.zookeeper.path.root /flink
high-availability.zookeeper.storageDir hdfs:///flink/recovery
internal.cluster.execution-mode NORMAL true
io.tmp.dirs /mnt/yarn/usercache/hadoop/appcache/application_1540292869184_0001
jobmanager.heap.mb 3072
jobmanager.rpc.port 41219
jobmanager.web.checkpoints.history 1000
parallelism.default 32
rest.port 0
state.backend filesystem
state.backend.fs.checkpointdir s3a://....
state.checkpoints.dir s3a://...
state.savepoints.dir s3a://...
taskmanager.heap.mb 6600
taskmanager.numberOfTaskSlots 1
web.port 0
web.tmpdir /tmp/flink-web-c3d16e22-1a33-46a2-9825-a6e268892199
yarn.application-attempts 10
yarn.maximum-failed-containers -1
zookeeper.sasl.disable true