git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]# Are all the statistics given to calcite, need to be exact or approximate?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

In Hive, column statistics like countDistinct, isPrimaryKey, etc.are need to be set. While doing so, in Hive, the following function sets primary key to true based on a assumption. public static void inferAndSetPrimaryKey(long numRows, List<ColStatistics> colStats) { if (colStats != null) { for (ColStatistics cs : colStats) { if (cs != null && cs.getCountDistint() >= numRows) { cs.setPrimaryKey(true); } else if (cs != null && cs.getRange() != null && cs.getRange().minValue != null && cs.getRange().maxValue != null) { if (numRows == ((cs.getRange().maxValue.longValue() - cs.getRange().minValue.longValue()) + 1)) { cs.setPrimaryKey(true); } } } } } If this is the case, considering I have only 2 values filled over the entire column, which are 1 and 1000, and 1000 is the numRows, then having primary key as true would be wrong. While planning, if suppose aggregation is the upcoming node, then that node need not be proceeded, considering primary key column will have only unique values. If we are assuming as above function to set primary key and if calcite also proceed with these assumptions, then the result will also be wrong. So how this could be solved? Similarly for count distinct also, is it okay to give approximate values to calcite?

- Prev by Date:
**Re: Contribution on TUMBLE Implementation! I have implemented it and I wonder if I can contribute it** - Next by Date:
**Re: Are all the statistics given to calcite, need to be exact or approximate?** - Previous by thread:
**Delete and Update projecting unnecessary columns** - Next by thread:
**Re: Are all the statistics given to calcite, need to be exact or approximate?** - Index(es):