git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Setting an allowable number of checkpoint failures


Hi Lakshmi,

Your understanding of "
*CheckpointConfig#setFailOnCheckpointingErrors(false)*" is correct, If this
is set to false, the task will only decline a the checkpoint and continue
running.

I think it is also a good choice to allow a number of failures to be set.
Flink currently only supports whether the Task fails if the checkpoint
fails. It is not supported to configure a threshold.

You can create an issue in JIRA to feedback this requirement.

Thanks, vino.

2018-08-04 4:28 GMT+08:00 Lakshmi Gururaja Rao <lrao@xxxxxxxx>:

> Hi,
>
> We are running into intermittent checkpoint failures while checkpointing to
> S3.
>
> As described in this thread -
>  http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/1-5-some-thing-weird-td21309.html
> <http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/1-5-some-thing-weird-td21309.html>,
> we see that the job restarts when it encounters such a failure.
>
> As mentioned in the thread, I see that there is an option to not fail tasks
> on checkpoint errors -
> *CheckpointConfig#setFailOnCheckpointingErrors(false)**. *However, this
> would mean that the job would continue running even in the case of
> persistent checkpoint failures. Is my understanding here correct?
>
> If above is true, then is there a way to configure an allowable number of
> checkpoint failures? i.e. something along the lines of "Don't fail the job
> if there are <=X number of checkpoint failures", so that *only *transient
> failures can be ignored.
>
> Thanks,
> Lakshmi
>