git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Setting an allowable number of checkpoint failures


Hi Till,

I think the way you proposed is a solution. But I think we also can provide
a solution to prevent Checkpoint from failing indefinitely, in case the Job
does not fail.

Instead, a threshold is given to allow the checkpoint to fail a few times.
When this threshold is reached, we decide to let the job fail.

Thanks, vino.

2018-08-06 15:14 GMT+08:00 Till Rohrmann <trohrmann@xxxxxxxxxx>:

> Hi Lakshmi,
>
> you could somewhat achieve the described behaviour by setting
> setFailOnCheckpointintErrors(true) and using the
> FailureRateRestartStrategy
> as the restart strategy. That way checkpoint failures will trigger a job
> restart (this is the downside) which is handled by the restart strategy.
> The FailureRateRestartStrategy allows for x failures to happen within in a
> given time interval. If this number is exceeded, then the job will
> terminally fail.
>
> Cheers,
> Till
>
> On Sat, Aug 4, 2018 at 4:58 AM vino yang <yanghua1127@xxxxxxxxx> wrote:
>
> > Hi Lakshmi,
> >
> > Your understanding of "
> > *CheckpointConfig#setFailOnCheckpointingErrors(false)*" is correct, If
> this
> > is set to false, the task will only decline a the checkpoint and continue
> > running.
> >
> > I think it is also a good choice to allow a number of failures to be set.
> > Flink currently only supports whether the Task fails if the checkpoint
> > fails. It is not supported to configure a threshold.
> >
> > You can create an issue in JIRA to feedback this requirement.
> >
> > Thanks, vino.
> >
> > 2018-08-04 4:28 GMT+08:00 Lakshmi Gururaja Rao <lrao@xxxxxxxx>:
> >
> > > Hi,
> > >
> > > We are running into intermittent checkpoint failures while
> checkpointing
> > to
> > > S3.
> > >
> > > As described in this thread -
> > >  http://apache-flink-user-mailing-list-archive.2336050.
> > > n4.nabble.com/1-5-some-thing-weird-td21309.html
> > > <http://apache-flink-user-mailing-list-archive.2336050.
> > > n4.nabble.com/1-5-some-thing-weird-td21309.html>,
> > > we see that the job restarts when it encounters such a failure.
> > >
> > > As mentioned in the thread, I see that there is an option to not fail
> > tasks
> > > on checkpoint errors -
> > > *CheckpointConfig#setFailOnCheckpointingErrors(false)**. *However,
> this
> > > would mean that the job would continue running even in the case of
> > > persistent checkpoint failures. Is my understanding here correct?
> > >
> > > If above is true, then is there a way to configure an allowable number
> of
> > > checkpoint failures? i.e. something along the lines of "Don't fail the
> > job
> > > if there are <=X number of checkpoint failures", so that *only
> *transient
> > > failures can be ignored.
> > >
> > > Thanks,
> > > Lakshmi
> > >
> >
>