git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 'Spool' Node support


Thanks Julian for the comment.

To some extent it is "split" in VolcanoPlanner. Let's revisit the example I
mentioned above:

TableSink1(on columns c1, c2)
    Project(c1, random() as c2, c3)
        TableScan
TableSink2(on columns c3, c2)
    Project(c1, random() as c2, c3)
        TableScan

The 2 projects share the common digests, thus it is recognized in
VolcanoPlanner, the plan in memo looks like:

TableSink1     TableSink2
         \                 /
             Project (c1, random() as c2, c3)
                 |
            TableScan

However, if we "add" a project here, the project will further match other
rule pattern (e.g., PrjectMergeRule), the random project will no longer
share the same digest, new plan looks like:

TableSink1              TableSink2
      |                               |
Project (c1, random())   Project(c3, random())
      |                               |
TableScan(c1)         TableScan(c3)

Notice that the random function call is executed twice, which breaks the
assumption of original sql query.

I think Spool node should prevent such kind of transformation.

On Tue, Oct 23, 2018 at 11:30 AM Julian Hyde <jhyde@xxxxxxxxxx> wrote:

> I assume you’re talking about HepPlanner? VolcanoPlanner doesn’t “split”
> anything, it only adds new things.
>
> As you’ve noticed Spool isn’t finished, but the idea would be to use
> VolcanoPlanner, because it can truly handle plans that are DAGs, then use
> some kind of costing trick to ensure that nodes that are shared are only
> counted in the overall cost once.
>
> > On Oct 22, 2018, at 8:26 PM, Ted Xu <frankxus@xxxxxxxxx> wrote:
> >
> > Hi folks,
> >
> > I'm not sure if there is a recommended way to represent diverged
> (multiple
> > parents) plan in Calcite. It’s true that RelNode data structure is
> > compatible with multiple parents, but it is not working in optimizer.
> >
> > For example, if we have query as follows,
> >
> > FROM (SELECT c1, random() as c2, c3 FROM src)
> > INSERT OVERWRITE TABLE src1 SELECT c1, c2
> > INSERT OVERWRITE TABLE src2 SELECT c3, c2
> >
> > TableSink1(on columns c1, c2)
> >    Project(c1, random() as c2, c3)
> >        TableScan
> > TableSink2(on columns c3, c2)
> >    Project(c1, random() as c2, c3)
> >        TableScan
> >
> > Planners will recognize Projects and TableScans share the common digests
> > thus merged together, but Project Transpose Rules splits them, which
> breaks
> > the random assumption.
> >
> > My solution is to add a Spool node to prevent any rule to further split a
> > sub-plan, but it generates sub-optimal result. I've noticed there is a
> > really old JIRA ticket https://jira.apache.org/jira/browse/CALCITE-481
> but
> > it was somehow suspended.
> >
> > I'd like to move on on this feature, but there are still something to do
> > first:
> >
> > 1. Let RelOptRuleCall to aware parents, currently only HepRelOptRuleCall
> > passes parents in certain cases.
> > 2. Let RelOptRuleOperand to define multiple parent patterns
> >
> > Please correct me if I'm something wrong, any suggestion will be much
> > appreciated.
>
>