git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[DISCUSS] Integrate Flink SQL well with Hive ecosystem


Hi all,

Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.

We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.

We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.

I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.

While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.

Regards,


Xuefu

References:

[2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.