git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: hadoopInputFormat and elasticsearch


Hi,

At the moment if the processing of any data input split fails,
Flink will restart the batch job completely from scratch.

There is an ongoing effort to improve fine-grained recovery in FLINK-4256.

Best,
Andrey

> On 2 Oct 2018, at 13:52, aviad <rotem.aviad@xxxxxxxxx> wrote:
> 
> Hi,
> 
> I want to write batch job which reads data from *elasticsearch* using
> *elasticsearch-hadoop* (https://github.com/elastic/elasticsearch-hadoop/)
> and *hadoopInputFormat*
> 
> example code (from
> https://github.com/genged/flink-playground/blob/master/src/main/java/com/mic/flink/FlinkMain.java):
> 
> 
> 
> elasticsearch-hadoop creates one Hadoop InputSplit (tasks) per Elasticsearch
> shard.
> so if my index have 20 shards, it will be split to 20 InputSplit
> 
> 
> /My question is:/
> What will happen if my job restart (failover) after finishing half of the
> InputSplit's ?
> Does hadoopInputFormat remember which InputSplit are finished and knows how
> to continue from where it stopped? (maybe read from beginning of unfinished
> InputSplit? ) or it starts from the beginning?
> 
> thanks
> 
> 
> 
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/