git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

JobManager did not respond within 60000 ms


I have a streaming job that works in standalone cluster. Flink version is 1.4.1. Everything was working so far. But since I added new treatments, I can not start my job anymore. I have this exception : 

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: JobManager did not respond within 60000 ms
        at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:524)
        at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:103)
        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)
        at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:402)
        at org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:802)
        at org.apache.flink.client.CliFrontend.run(CliFrontend.java:282)
        at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1054)
        at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1101)
        at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1098)
        at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
        at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1098)
Caused by: org.apache.flink.runtime.client.JobTimeoutException: JobManager did not respond within 60000 ms
        at org.apache.flink.runtime.client.JobClient.submitJobDetached(JobClient.java:437)
        at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:516)
        ... 11 more
Caused by: java.util.concurrent.TimeoutException
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
        at org.apache.flink.runtime.client.JobClient.submitJobDetached(JobClient.java:435)
        ... 12 more

I see a very strange behavior. When I comment on a function (any one, for example a FilterFunction, which was present before or after my modification). 
I tried to change the configuration (akka.client.timeout and akka.framesize) without success.

This is my flink-conf.yaml
  jobmanager.rpc.address: myhost
  jobmanager.rpc.port: 6123
  jobmanager.heap.mb: 128
  taskmanager.heap.mb: 1024
  taskmanager.numberOfTaskSlots: 100
  taskmanager.memory.preallocate: false
  taskmanager.data.port: 6121
  parallelism.default: 1
  taskmanager.tmp.dirs: /dohdev/flink/tmp/tskmgr
  blob.storage.directory: /dohdev/flink/tmp/blob
  jobmanager.web.port: -1
  high-availability: zookeeper
  high-availability.zookeeper.quorum: localhost:2181
  high-availability.zookeeper.path.root: /dohdev/flink
  high-availability.cluster-id: dev
  high-availability.storageDir: file:////mnt/metaflink
  high-availability.zookeeper.storageDir: /mnt/metaflink/inh/agregateur/recovery
  restart-strategy: fixed-delay
  restart-strategy.fixed-delay.attempts: 1000
  restart-strategy.fixed-delay.delay: 5 s
  zookeeper.sasl.disable: true
  blob.service.cleanup.interval: 60

And I launch a job with this command : bin/flink run -d myjar.jar

I added as an attachment a graph of my job when it works (Graph.PNG).

Do you have an idea of the problem ?

Thanks.
Julien


Attachment: Graph.PNG
Description: PNG image