git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] Flink backward compatibility


so let's take a look...

binary client compatibility: The key issue i see hasn't changed since the last time this was brought up: Clients rely on the JobGraph to submit the job which is an internal data structure. AFAIK there will also be changes made to said class soon(ish). So long as we don't introduce a decoupled structure and/or compatibility routines here this is not feasible. The client in general may be in the way here. The unfortunate reality is that the client code is one big mess that is due for a complete rewrite. I doubt anyone has an all-encompassing view over hidden assumptions that are baked into it, that we would have to retain if we go for backwards compatibility.

CLI compatibility: Does this include all start scripts or just the flink executable? I think this makes sense, but so far we did a reasonable job at not changing command-line parameters. (But maybe only because changing this part of the CLI is a massive pain...)

REST API: The versioning introduced in 1.7.0 is a significant step towards a stable API as it allows us to modify things without (inherently) breaking it. We're primarily missing tests here to verify the stability, but these are being worked on.

Metrics: I would not categorize them as stable in general, the reason being that we are still refactoring and stream-lining the usage. For some core system metrics (checkpoint info, IO) we can _probably_ guarantee stability.

On 27.11.2018 18:43, Thomas Weise wrote:
Some scenarios that come to mind:

Flink client binary compatibility with remote cluster: This would include
RemoteStreamEnvironment, RESTClusterClient etc. - User should be able to
submit the job built with 1.6.x using the 1.6.x binaries to the remote
Flink 1.7.x or later cluster. The use case for this is Beam.

REST API compatibility: User tooling built against 1.6.x REST API spec
continues to work with 1.7.x or later REST API

CLI compatibility: The commands/options exposed in the CLI continue to be
available after an upgrade. Users can just point to the new CLI location.

Metrics:  Metrics that exist in 1.6.x are available in 1.7.x

There is probably a lot more (such as various backends that users can
configure and their options) and there are different levels of
cost/complexity trade-offs. I brought up the REST API in the past after
observing the tools breakage when going from 1.4.x to 1.5.x.

The client binary compatibility issue will grow more severe as the
ecosystem expands. Beam is a representative example in that category. To
solve the issue downstream, different communities and users each would need
to come up with build system/release support for multiple parallel Flink
versions. It would be better to shield from such complexity.

Thanks,
Thomas


On Tue, Nov 27, 2018 at 6:27 AM Fabian Hueske <fhueske@xxxxxxxxx> wrote:

Hi,

I think this is a very good discussion to have.
Flink is becoming part of more and more production deployments and more
tools are built around it.
The question is do we want to (or can we) make parts of the
control/maintenance/monitoring API stable such that external
systems/frameworks can rely on them as stable.

Which APIs are relevant?
Which APIs could be declared as stable?
Which parts are still evolving?

Fabian

Am Di., 27. Nov. 2018 um 15:10 Uhr schrieb Chesnay Schepler <
chesnay@xxxxxxxxxx>:

I think this discussion needs specific examples as to what should be
possible as it otherwise is to vague / open to interpretation.

For example, "job submission" may refer to CLI invocations continuing to
work (i.e. CLI arguments), or being able to use a 1.6 client against a
1.7 cluster, which are entirely different things.

What does "management" include? Dependencies? Set of  jars that are
released on maven? Set of jars bundled with flink-dist?

On 26.11.2018 17:24, Thomas Weise wrote:
Hi,

I wanted to bring back the topic of backward compatibility with respect
to
all/most of the user facing aspects of Flink. Please note that isn't
limited to the programming API, but also includes job submission and
management.

As can be seen in [1], changes in these areas cause difficulties
downstream. Projects have to choose between Flink versions and users are
ultimately at disadvantage, either by not being able to use the desired
dependency or facing forced upgrades to their infrastructure.

IMO the preferred solution would be that downstream projects can build
against a minimum version of Flink and expect compatibility with future
releases of the major version stream. For example, my project depends on
1.6.x and can expect to run without recompilation on 1.7.x and later.

How far away is Flink from stabilizing the surface that affects typical
users?

Thanks,
Thomas

[1] https://issues.apache.org/jira/browse/BEAM-5419