[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SQL Filter Pushdowns in Apache Beam SQL

This has come up in a couple of in-person conversations. Pushing filtering and projection into to connectors is something we intend to do. Calcite's optimizer is designed to support this, we just don't have it set up.

Your use case sounds like one that might test the limits of that, since the JDBC read would occur before windowing or setting it up as a side input. I'd be curious what a Beam pipeline to do this without SQL would look like.


On Wed, Jun 13, 2018 at 8:47 AM Lukasz Cwik <lcwik@xxxxxxxxxx> wrote:
It is currently the later where all the data is read and then filtered within the pipeline. Note that this doesn't mean that all the data is loaded into memory as the way that the join is done is dependent on the Runner that is powering the pipeline.

Kenn had shared this doc[1] which is starting to look at integrating Runners and IO into the SQL shell and attempting to start defining a way to map properties from SQL onto the IO connector but it seems natural that the filter would get pushed down to the IO connector as well. Please take a look and feel free to comment.

On Wed, Jun 13, 2018 at 7:39 AM Harshvardhan Agrawal <harshvardhan.agr93@xxxxxxxxx> wrote:

We are currently playing with Apache Beam’s SQL extension on top of Flink. One of the features that we were interested is the SQL Predicate Pushdown feature that Spark provides. Does Beam support that?

For eg:
I have an unbounded dataset that I want to join with some static reference data stored in a database. Will beam perform the logic of figuring out all the unique keys in the window and push it down to the jdbc source or will it bring all the data from the jdbc source into memory and then perform the join?