git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Are all the statistics given to calcite, need to be exact or approximate?


Not sure what you mean by “depict”.

If you want a description of the statistics, read their documentation. Hopefully it’s clear which statistics are exact and which are not.

Obviously we’d all like statistics to be exact. But for many statistics, it’s impossible to get exact answer unless you actually execute the query. For example, the row count. So, such statistics are useless for purposes of query optimization.



> On May 7, 2018, at 2:23 AM, aishwaryaanns@xxxxxxxxx wrote:
> 
> Can you depict which column statistics should be exact and which are all can be approximate to get a decent plan?
> 
> On 2018/05/04 05:31:30, aishwaryaanns@xxxxxxxxx <aishwaryaanns@xxxxxxxxx> wrote: 
>> Yes ColStatistics is in Hive and it holds all statistics about the columns. 
>> 
>> On 2018/05/03 16:26:02, Julian Hyde <jhyde@xxxxxxxxxx> wrote: 
>>> It depends on the statistic. Most of them are approximate.
>>> 
>>> It’s the "garbage in, garbage out" principle. An exact statistic may be of a bit more (or a lot more) use to the consumer of the statistic, but is more effort for the producer of the statistic.
>>> 
>>> RelMdMaxRowCount is one of the few exact ones. If RelMdMaxRowCount says 10, the relation might return 0 rows or 9 rows or 10 rows but never 11 rows.
>>> 
>>> RelMdPredicates and is also exact (albeit not numeric). RelMdUniqueKeys is exact (which is to say, it returns a key, it is definitely unique; there may be some unique keys that it does not know about).
>>> 
>>> I don’t know what ColStatistics is. Is it a Hive thing? I surmise that is it based on RelMdRowCount, which is approximate.
>>> 
>>> Julian
>>> 
>>> 
>>>> On May 3, 2018, at 5:41 AM, Valli Annamalai <aishwaryaanns@xxxxxxxxx> wrote:
>>>> 
>>>> In Hive, column statistics like countDistinct, isPrimaryKey, etc.are need
>>>> to be set. While doing so, in Hive, the following function sets primary key
>>>> to true based on a assumption.
>>>> 
>>>> 
>>>>   public static void inferAndSetPrimaryKey(long numRows,
>>>> List<ColStatistics> colStats) {
>>>>       if (colStats != null) {
>>>>         for (ColStatistics cs : colStats) {
>>>>           if (cs != null && cs.getCountDistint() >= numRows) {
>>>>             cs.setPrimaryKey(true);
>>>>           }
>>>>           else if (cs != null && cs.getRange() != null &&
>>>> cs.getRange().minValue != null &&
>>>>               cs.getRange().maxValue != null) {
>>>>             if (numRows ==
>>>>                 ((cs.getRange().maxValue.longValue() -
>>>> cs.getRange().minValue.longValue()) + 1)) {
>>>>               cs.setPrimaryKey(true);
>>>>             }
>>>>           }
>>>>         }
>>>>       }
>>>>     }
>>>> 
>>>> If this is the case, considering I have only 2 values filled over the
>>>> entire column, which are 1 and 1000, and 1000 is the numRows, then having
>>>> primary key as true would be wrong. While planning, if suppose aggregation
>>>> is the upcoming node, then that node need not be proceeded, considering
>>>> primary key column will have only unique values.
>>>> 
>>>> If we are assuming as above function to set primary key and if calcite also
>>>> proceed with these assumptions, then the result will also be wrong. So how
>>>> this could be solved?
>>>> 
>>>> Similarly for count distinct also, is it okay to give approximate values to
>>>> calcite?
>>> 
>>> 
>>