[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Are all the statistics given to calcite, need to be exact or approximate?

It depends on the statistic. Most of them are approximate.

It’s the "garbage in, garbage out" principle. An exact statistic may be of a bit more (or a lot more) use to the consumer of the statistic, but is more effort for the producer of the statistic.

RelMdMaxRowCount is one of the few exact ones. If RelMdMaxRowCount says 10, the relation might return 0 rows or 9 rows or 10 rows but never 11 rows.

RelMdPredicates and is also exact (albeit not numeric). RelMdUniqueKeys is exact (which is to say, it returns a key, it is definitely unique; there may be some unique keys that it does not know about).

I don’t know what ColStatistics is. Is it a Hive thing? I surmise that is it based on RelMdRowCount, which is approximate.


> On May 3, 2018, at 5:41 AM, Valli Annamalai <aishwaryaanns@xxxxxxxxx> wrote:
> In Hive, column statistics like countDistinct, isPrimaryKey, etc.are need
> to be set. While doing so, in Hive, the following function sets primary key
> to true based on a assumption.
>    public static void inferAndSetPrimaryKey(long numRows,
> List<ColStatistics> colStats) {
>        if (colStats != null) {
>          for (ColStatistics cs : colStats) {
>            if (cs != null && cs.getCountDistint() >= numRows) {
>              cs.setPrimaryKey(true);
>            }
>            else if (cs != null && cs.getRange() != null &&
> cs.getRange().minValue != null &&
>                cs.getRange().maxValue != null) {
>              if (numRows ==
>                  ((cs.getRange().maxValue.longValue() -
> cs.getRange().minValue.longValue()) + 1)) {
>                cs.setPrimaryKey(true);
>              }
>            }
>          }
>        }
>      }
> If this is the case, considering I have only 2 values filled over the
> entire column, which are 1 and 1000, and 1000 is the numRows, then having
> primary key as true would be wrong. While planning, if suppose aggregation
> is the upcoming node, then that node need not be proceeded, considering
> primary key column will have only unique values.
> If we are assuming as above function to set primary key and if calcite also
> proceed with these assumptions, then the result will also be wrong. So how
> this could be solved?
> Similarly for count distinct also, is it okay to give approximate values to
> calcite?