git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]# Re: Are all the statistics given to calcite, need to be exact or approximate?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Can you depict which column statistics should be exact and which are all can be approximate to get a decent plan? On 2018/05/04 05:31:30, aishwaryaanns@xxxxxxxxx <aishwaryaanns@xxxxxxxxx> wrote: > Yes ColStatistics is in Hive and it holds all statistics about the columns. > > On 2018/05/03 16:26:02, Julian Hyde <jhyde@xxxxxxxxxx> wrote: > > It depends on the statistic. Most of them are approximate. > > > > It’s the "garbage in, garbage out" principle. An exact statistic may be of a bit more (or a lot more) use to the consumer of the statistic, but is more effort for the producer of the statistic. > > > > RelMdMaxRowCount is one of the few exact ones. If RelMdMaxRowCount says 10, the relation might return 0 rows or 9 rows or 10 rows but never 11 rows. > > > > RelMdPredicates and is also exact (albeit not numeric). RelMdUniqueKeys is exact (which is to say, it returns a key, it is definitely unique; there may be some unique keys that it does not know about). > > > > I don’t know what ColStatistics is. Is it a Hive thing? I surmise that is it based on RelMdRowCount, which is approximate. > > > > Julian > > > > > > > On May 3, 2018, at 5:41 AM, Valli Annamalai <aishwaryaanns@xxxxxxxxx> wrote: > > > > > > In Hive, column statistics like countDistinct, isPrimaryKey, etc.are need > > > to be set. While doing so, in Hive, the following function sets primary key > > > to true based on a assumption. > > > > > > > > > public static void inferAndSetPrimaryKey(long numRows, > > > List<ColStatistics> colStats) { > > > if (colStats != null) { > > > for (ColStatistics cs : colStats) { > > > if (cs != null && cs.getCountDistint() >= numRows) { > > > cs.setPrimaryKey(true); > > > } > > > else if (cs != null && cs.getRange() != null && > > > cs.getRange().minValue != null && > > > cs.getRange().maxValue != null) { > > > if (numRows == > > > ((cs.getRange().maxValue.longValue() - > > > cs.getRange().minValue.longValue()) + 1)) { > > > cs.setPrimaryKey(true); > > > } > > > } > > > } > > > } > > > } > > > > > > If this is the case, considering I have only 2 values filled over the > > > entire column, which are 1 and 1000, and 1000 is the numRows, then having > > > primary key as true would be wrong. While planning, if suppose aggregation > > > is the upcoming node, then that node need not be proceeded, considering > > > primary key column will have only unique values. > > > > > > If we are assuming as above function to set primary key and if calcite also > > > proceed with these assumptions, then the result will also be wrong. So how > > > this could be solved? > > > > > > Similarly for count distinct also, is it okay to give approximate values to > > > calcite? > > > > >

- Prev by Date:
**How to disable EnumerableThetaJoin?** - Next by Date:
**Re: How to disable EnumerableThetaJoin?** - Previous by thread:
**Re: Are all the statistics given to calcite, need to be exact or approximate?** - Next by thread:
**Re: Are all the statistics given to calcite, need to be exact or approximate?** - Index(es):