git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] Unified Core API for Streaming and Batch


Hi Stephan:
I totally agree with you, this discussion covers too many topics, so we can cut it into a series of sub-discussions proposed by you,  firstly we can focus on phrase-1: “What Flink API Stack Should be for a Unified Engine”.
Best,
Feng Wang

On Dec 3, 2018, at 19:36, Stephan Ewen <sewen@xxxxxxxxxx<mailto:sewen@xxxxxxxxxx>> wrote:

Hi all!

This is a great discussion to start and I agree with the idea behind it. We
should get started designing what the Flink stack should look like in the
future.

This discussion is very big, though, and from past experiences if the scope
is too big, the discussions and up falling apart when everyone goes into
different details.
So my suggestion would be to stage this discussion and take this aspect
after aspect, starting with what we want to expose to users and then going
into the internal details.

*Discussion (1) What should the API stack look like*
 - Relationship of DataStream, DataSet, and Table API
 - Where do we want automatic optimization, and where not
 - Future of DataSet (subsumed in data stream, or remains independent)
 - What happens with iterations
 - What happens with the collection execution mode

*Discussion (2) What should the abstractions look like.*
 - This is based on the outcome of (1)
 - Operator DAG
 - Operator Interface
 - what is sent to the REST API when a job is submitted, etc.
 - modules and dependency structure

*Discussion (3) what is special for Batch*
 - I would like to follow the philosophy that "batch allows us to activate
additional optimizations" made possible by the bounded nature of the inputs.
 - special case scheduling
 - additional runtime algorithms (like hybrid hash joins)
 - no watermarks / late data / etc.
 - Special casing in failover (or possibly not, could be still the same
core mechanism

What do you think?

Best,
Stephan



On Mon, Dec 3, 2018 at 12:17 PM Haibo Sun <sunhaibotb@xxxxxxx<mailto:sunhaibotb@xxxxxxx>> wrote:

Thanks, zhijiang.


For the optimization, such as cost-based estimation, we still want to keep
it in the data set layer,
but your suggestion is also a thought that can be considered.


As I know, currently these batch scenarios have been contained in DataSet,
such as
the sort-merge join algorithm. So I think that the unification should
consider such features
as input selection at reading.


Best,
Haibo


At 2018-12-03 16:38:13, "zhijiang" <wangzhijiang999@xxxxxxxxxx.INVALID<mailto:wangzhijiang999@xxxxxxxxxx.INVALID>>
wrote:
Hi haibo,

Thanks for bringing this discussion!

I reviewd the google doc and really like the idea of unifying the stream
and batch in all stacks. Currently only network runtime stack is unified
for both stream and batch jobs, but the compilation, operator and runtime
task stacks are all separate. The stream stack developed frequently and
behaved dominantly these years, but the batch stack was touched less. If
they are unified into one stack, the batch jobs can also get benefits from
all the improvements. I think it is a very big work but worth doing, left
some concerns:

1. The current job graph generation for batch covers complicated
optimization such as cost-based estimate, plan etc. Would this part also be
considered retaining during integrating with stream graph generation?

2. I saw some other special improvements for batch scenarios in the doc,
such as input selection while reading. I acknowledge these roles for
special batch scenarios, but they seem not the blocker for unification
motivation, because current batch jobs can also work without these
improvements. So the further improvments can be separated into individual
topics after we reaching the unification of stream and batch firstly.

Best,
Zhijiang


------------------------------------------------------------------
发件人:孙海波 <sunhaibotb@xxxxxxx<mailto:sunhaibotb@xxxxxxx>>
发送时间:2018年12月3日(星期一) 10:52
收件人:dev <dev@xxxxxxxxxxxxxxxx<mailto:dev@xxxxxxxxxxxxxxxx>>
主 题:[DISCUSS] Unified Core API for Streaming and Batch

Hi all,
This post proposes unified core API for Streaming and Batch.
Currently DataStream and DataSet adopt separated compilation processes,
execution tasks
and basic programming models in the runtime layer, which complicates the
system implementation.
We think that batch jobs can be processed in the same way as streaming
jobs, thus we can unify
the execution stack of DataSet into that of DataStream.  After the
unification the DataSet API will
also be built on top of StreamTransformation, and its basic programming
model will be changed
from "UDF on Driver" to "UDF on StreamOperator". Although the DataSet
operators will need to
implement the interface StreamOperator instead after the unification,
user jobs do not need to change
since DataSet uses the same UDF interfaces as DataStream.

The unification has at least three benefits:
1. The system will be greatly simplified with the same execution stack
for both streaming and batch jobs.
2. It is no longer necessary to implement two sets of Driver(s) (operator
strategies) for batch, namely chained and non-chained.
3. The unified programming model enables streaming and batch jobs to
share the same operator implementation.


The following is the design draft. Any feedback is highly appreciated.

https://docs.google.com/document/d/1G0NUIaaNJvT6CMrNCP6dRXGv88xNhDQqZFrQEuJ0rVU/edit?usp=sharing

Best,
Haibo