git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Batch job stuck in Canceled state in Flink 1.5


Hi All,

I totally missed this thread. I've encountered same issue in Flink
1.5.0 RC4. Please look over the attached logs of JM and impacted TM.

Job ID 390a96eaae733f8e2f12fc6c49b26b8b

--
Thanks,
Amit

On Thu, May 3, 2018 at 8:31 PM, Nico Kruber <nico@xxxxxxxxxxxxxxxxx> wrote:
> Also, please have a look at the other TaskManagers' logs, in particular
> the one that is running the operator that was mentioned in the
> exception. You should look out for the ID 98f5976716234236dc69fb0e82a0cc34.
>
>
> Nico
>
>
> PS: Flink logs files should compress quite nicely if they grow too big :)
>
> On 03/05/18 14:07, Stephan Ewen wrote:
>> Google Drive would be great.
>>
>> Thanks!
>>
>> On Thu, May 3, 2018 at 1:33 PM, Amit Jain <aj2011it@xxxxxxxxx
>> <mailto:aj2011it@xxxxxxxxx>> wrote:
>>
>>     Hi Stephan,
>>
>>     Size of JM log file is 122 MB. Could you provide me other media to
>>     post the same? We can use Google Drive if that's fine with you.
>>
>>     --
>>     Thanks,
>>     Amit
>>
>>     On Thu, May 3, 2018 at 12:58 PM, Stephan Ewen <sewen@xxxxxxxxxx
>>     <mailto:sewen@xxxxxxxxxx>> wrote:
>>     > Hi Amit!
>>     >
>>     > Thanks for sharing this, this looks like a regression with the
>>     network stack
>>     > changes.
>>     >
>>     > The log you shared from the TaskManager gives some hint, but that
>>     exception
>>     > alone should not be a problem. That exception can occur under a
>>     race between
>>     > deployment of some tasks while the whole job is entering a
>>     recovery phase
>>     > (maybe we should not print it so prominently to not confuse
>>     users). There
>>     > must be something else happening on the JobManager. Can you share
>>     the JM
>>     > logs as well?
>>     >
>>     > Thanks a lot,
>>     > Stephan
>>     >
>>     >
>>     > On Wed, May 2, 2018 at 12:21 PM, Amit Jain <aj2011it@xxxxxxxxx
>>     <mailto:aj2011it@xxxxxxxxx>> wrote:
>>     >>
>>     >> Thanks! Fabian
>>     >>
>>     >> I will try using the current release-1.5 branch and update this
>>     thread.
>>     >>
>>     >> --
>>     >> Thanks,
>>     >> Amit
>>     >>
>>     >> On Wed, May 2, 2018 at 3:42 PM, Fabian Hueske <fhueske@xxxxxxxxx
>>     <mailto:fhueske@xxxxxxxxx>> wrote:
>>     >> > Hi Amit,
>>     >> >
>>     >> > We recently fixed a bug in the network stack that affected
>>     batch jobs
>>     >> > (FLINK-9144).
>>     >> > The fix was added after your commit.
>>     >> >
>>     >> > Do you have a chance to build the current release-1.5 branch
>>     and check
>>     >> > if
>>     >> > the fix also resolves your problem?
>>     >> >
>>     >> > Otherwise it would be great if you could open a blocker issue
>>     for the
>>     >> > 1.5
>>     >> > release to ensure that this is fixed.
>>     >> >
>>     >> > Thanks,
>>     >> > Fabian
>>     >> >
>>     >> > 2018-04-29 18:30 GMT+02:00 Amit Jain <aj2011it@xxxxxxxxx
>>     <mailto:aj2011it@xxxxxxxxx>>:
>>     >> >>
>>     >> >> Cluster is running on commit 2af481a
>>     >> >>
>>     >> >> On Sun, Apr 29, 2018 at 9:59 PM, Amit Jain <aj2011it@xxxxxxxxx
>>     <mailto:aj2011it@xxxxxxxxx>> wrote:
>>     >> >> > Hi,
>>     >> >> >
>>     >> >> > We are running numbers of batch jobs in Flink 1.5 cluster
>>     and few of
>>     >> >> > those
>>     >> >> > are getting stuck at random. These jobs having the following
>>     failure
>>     >> >> > after
>>     >> >> > which operator status changes to CANCELED and stuck to same.
>>     >> >> >
>>     >> >> > Please find complete TM's log at
>>     >> >> >
>>     https://gist.github.com/imamitjain/066d0e99990ee24f2da1ddc83eba2012
>>     <https://gist.github.com/imamitjain/066d0e99990ee24f2da1ddc83eba2012>
>>     >> >> >
>>     >> >> >
>>     >> >> > 2018-04-29 14:57:24,437 INFO
>>     >> >> > org.apache.flink.runtime.taskmanager.Task
>>     >> >> > - Producer 98f5976716234236dc69fb0e82a0cc34 of partition
>>     >> >> > 5b8dd5ac2df14266080a8e049defbe38 disposed. Cancelling execution.
>>     >> >> >
>>     >> >> >
>>     org.apache.flink.runtime.jobmanager.PartitionProducerDisposedException:
>>     >> >> > Execution 98f5976716234236dc69fb0e82a0cc34 producing partition
>>     >> >> > 5b8dd5ac2df14266080a8e049defbe38 has already been disposed.
>>     >> >> > at
>>     >> >> >
>>     >> >> >
>>     >> >> >
>>     org.apache.flink.runtime.jobmaster.JobMaster.requestPartitionState(JobMaster.java:610)
>>     >> >> > at sun.reflect.GeneratedMethodAccessor107.invoke(Unknown Source)
>>     >> >> > at
>>     >> >> >
>>     >> >> >
>>     >> >> >
>>     sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>     >> >> > at java.lang.reflect.Method.invoke(Method.java:498)
>>     >> >> > at
>>     >> >> >
>>     >> >> >
>>     >> >> >
>>     org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:210)
>>     >> >> > at
>>     >> >> >
>>     >> >> >
>>     >> >> >
>>     org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154)
>>     >> >> > at
>>     >> >> >
>>     >> >> >
>>     >> >> >
>>     org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleMessage(FencedAkkaRpcActor.java:69)
>>     >> >> > at
>>     >> >> >
>>     >> >> >
>>     >> >> >
>>     org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$onReceive$1(AkkaRpcActor.java:132)
>>     >> >> > at
>>     >> >> >
>>     >> >> >
>>     akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)
>>     >> >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>>     >> >> > at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>>     >> >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>>     >> >> > at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>>     >> >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>>     >> >> > at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>>     >> >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>>     >> >> > at
>>     >> >> >
>>     scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>     >> >> > at
>>     >> >> >
>>     >> >> >
>>     >> >> >
>>     scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>     >> >> > at
>>     >> >> >
>>     >> >> >
>>     scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>     >> >> > at
>>     >> >> >
>>     >> >> >
>>     >> >> >
>>     scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>     >> >> >
>>     >> >> >
>>     >> >> > Thanks
>>     >> >> > Amit
>>     >> >
>>     >> >
>>     >
>>     >
>>
>>
>
> --
> Nico Kruber | Software Engineer
> data Artisans
>
> Follow us @dataArtisans
> --
> Join Flink Forward - The Apache Flink Conference
> Stream Processing | Event Driven | Real Time
> --
> Data Artisans GmbH | Stresemannstr. 121A,10963 Berlin, Germany
> data Artisans, Inc. | 1161 Mission Street, San Francisco, CA-94103, USA
> --
> Data Artisans GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
>

Attachment: canceled_job.tar.gz
Description: application/gzip