git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Setting an allowable number of checkpoint failures


Hi Lakshmi,

you could somewhat achieve the described behaviour by setting
setFailOnCheckpointintErrors(true) and using the FailureRateRestartStrategy
as the restart strategy. That way checkpoint failures will trigger a job
restart (this is the downside) which is handled by the restart strategy.
The FailureRateRestartStrategy allows for x failures to happen within in a
given time interval. If this number is exceeded, then the job will
terminally fail.

Cheers,
Till

On Sat, Aug 4, 2018 at 4:58 AM vino yang <yanghua1127@xxxxxxxxx> wrote:

> Hi Lakshmi,
>
> Your understanding of "
> *CheckpointConfig#setFailOnCheckpointingErrors(false)*" is correct, If this
> is set to false, the task will only decline a the checkpoint and continue
> running.
>
> I think it is also a good choice to allow a number of failures to be set.
> Flink currently only supports whether the Task fails if the checkpoint
> fails. It is not supported to configure a threshold.
>
> You can create an issue in JIRA to feedback this requirement.
>
> Thanks, vino.
>
> 2018-08-04 4:28 GMT+08:00 Lakshmi Gururaja Rao <lrao@xxxxxxxx>:
>
> > Hi,
> >
> > We are running into intermittent checkpoint failures while checkpointing
> to
> > S3.
> >
> > As described in this thread -
> >  http://apache-flink-user-mailing-list-archive.2336050.
> > n4.nabble.com/1-5-some-thing-weird-td21309.html
> > <http://apache-flink-user-mailing-list-archive.2336050.
> > n4.nabble.com/1-5-some-thing-weird-td21309.html>,
> > we see that the job restarts when it encounters such a failure.
> >
> > As mentioned in the thread, I see that there is an option to not fail
> tasks
> > on checkpoint errors -
> > *CheckpointConfig#setFailOnCheckpointingErrors(false)**. *However, this
> > would mean that the job would continue running even in the case of
> > persistent checkpoint failures. Is my understanding here correct?
> >
> > If above is true, then is there a way to configure an allowable number of
> > checkpoint failures? i.e. something along the lines of "Don't fail the
> job
> > if there are <=X number of checkpoint failures", so that *only *transient
> > failures can be ignored.
> >
> > Thanks,
> > Lakshmi
> >
>


( ! ) Warning: include(msgfooter.php): failed to open stream: No such file or directory in /var/www/git/apache-flink-development/msg08528.html on line 130
Call Stack
#TimeMemoryFunctionLocation
10.0009358376{main}( ).../msg08528.html:0

( ! ) Warning: include(): Failed opening 'msgfooter.php' for inclusion (include_path='.:/var/www/git') in /var/www/git/apache-flink-development/msg08528.html on line 130
Call Stack
#TimeMemoryFunctionLocation
10.0009358376{main}( ).../msg08528.html:0