git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Are all the statistics given to calcite, need to be exact or approximate?


In Hive, column statistics like countDistinct, isPrimaryKey, etc.are need
to be set. While doing so, in Hive, the following function sets primary key
to true based on a assumption.


    public static void inferAndSetPrimaryKey(long numRows,
List<ColStatistics> colStats) {
        if (colStats != null) {
          for (ColStatistics cs : colStats) {
            if (cs != null && cs.getCountDistint() >= numRows) {
              cs.setPrimaryKey(true);
            }
            else if (cs != null && cs.getRange() != null &&
cs.getRange().minValue != null &&
                cs.getRange().maxValue != null) {
              if (numRows ==
                  ((cs.getRange().maxValue.longValue() -
cs.getRange().minValue.longValue()) + 1)) {
                cs.setPrimaryKey(true);
              }
            }
          }
        }
      }

If this is the case, considering I have only 2 values filled over the
entire column, which are 1 and 1000, and 1000 is the numRows, then having
primary key as true would be wrong. While planning, if suppose aggregation
is the upcoming node, then that node need not be proceeded, considering
primary key column will have only unique values.

If we are assuming as above function to set primary key and if calcite also
proceed with these assumptions, then the result will also be wrong. So how
this could be solved?

Similarly for count distinct also, is it okay to give approximate values to
calcite?