git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Druid + Theta Sketches performance



On 2018/10/19 14:42:18, Charles Allen <charles.allen@xxxxxxxx.INVALID> wrote: 
> This is a good callout. Those numbers still seem very slow. One item I'm
> curious of is if you are dropping the id when you index, or if the id is
> also being indexed into the druid segments.
> 
> With how druid does indexing, it dictionary encodes all the dimension
> values. So the cardinality of rows is a factor of QueryGranularity and the
> cardinality of dimension value tuples per query granularity "bucket". This
> allows dynamic slice and dice on the data. But if it is accidentally
> including a dimension with very high cardinality (like ID) in the
> dictionary encoding, then it is not able to make efficient use of roll-up.
> 
> In order to facilitate dynamic slice and dice, the theta sketches need to
> have *some* kind of object stored per dimension tuple per query granularity
> (but only if the tuple appears in that bucket). So you can reduce the
> number of things that get read off of disk by trying to increase the
> rollup. Usually this is done by dropping or reducing high cardinality
> dimensions, but can also be done by changing the query granularity.
> 
> Another trick is to use topN or Timeseries. In general, those query types
> are able to able to have better optimizations since they have a very
> limited scope use case.
> 
> Now, to Theta Sketches itself, I am not as familiar with the Theta Sketches
> code paths. It is possible there are performance gains to be had.
> 
> Hope this helps,
> Charles Allen
> 
> 
> On Fri, Oct 19, 2018 at 3:38 AM alex.rnv.ru@xxxxxxxxx <alex.rnv.ru@xxxxxxxxx>
> wrote:
> 
> > Hi Druid devs,
> > I am testing Druid for our specific count distinct estimation case. Data
> > was ingested via Hadoop indexer.
> > When simplified, it has following schema:
> > timestamp    key    country    theta-sketch<id>    event-counter
> > So, there are 2 dimensions, one counter metric, one theta sketch metric.
> > Data granularity is a DAY.
> > Data source in deep storage is 150-200GB per day.
> >
> > I was doing some test runs with our small test cluster (4 Historical
> > nodes, 8 CPU, 64GB RAM, 500SSD RAM). I admit with this RAM-SSD ratio and
> > number of nodes it is not going to be fast. The question though is in
> > theta-sketches performance compared to counters aggregation. The difference
> > is an order of magnitude. E.g.: GroupBy query for a single key, aggregated
> > on 7 days:
> > event-counters - 30 seconds.
> > theta-sketches -  7 minutes.
> >
> > Theta Sketch aggregation implies more work than summing up numbers of
> > course. But Theta Sketch documentation says that union operation is very
> > fast.
> >
> > I did some profiling of one of Historical nodes. Most of CPU time is spent
> > in
> > io.druid.query.aggregation.datasketches.theta.SketchObjectStrategy.fromByteBuffer(ByteBuffer,
> > int). Which I think is moving Sketch objects from off-heap to managed heap.
> > To be precise, time is spent in sketch library methods
> > com.yahoo.memory.WritableMemoryImpl.region
> > com.yahoo.memory.Memory.wrap
> >
> > Do not think anything is wrong with this code, except for why is it called
> > so many times.
> > Which leads to main question. I do not really understand how theta-sketch
> > is stored in columnar database. Assuming it is stored same way as counter,
> > it means that for every combination of "key" and "country" (dimensions from
> > above) - there is a theta sketch structure to be stored. In our case "key"
> > cardinality is quite high. Hence so many Sketch structure accesses in Java.
> > Looks extremely ineffective. Again, it is just an assumption. Please excuse
> > me if am wrong here.
> >
> > If you continue thinking in this direction, in terms of performance it
> > makes sense to store one Theta sketch for every dimension value, so instead
> > of having cardinality(key) * cardinality(countries) entries there will be
> > cardinality(key) + cardinality(countries) sketches. In this case it looks
> > like an index, not a part of columnar storage itself.
> > Queries for 2 dimensions are easy, as there is only one INTERSECTION to be
> > done. It all looks like a natural thing to do for sketches, as there will
> > be a win in terms of storage and query performance.
> > My question is if I am right or wrong in my assumptions. If my
> > understanding is not correct and sketches are already stored in optimal
> > way, could someone give advice on speeding up computations on a single
> > Historical node? Otherwise, wanted to ask if there is an attempt or
> > discussion to use sketches in the way I described.
> > Thanks in advance.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxx
> > For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxx
> >
> >
> Charles, 
thank you for quick reply.
My data schema part in hadoop indexing spec looks like (we copy data from TSV):
    "dataSchema" : {
      "dataSource" : "theta-poc",
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "day",
        "queryGranularity" : "none",
        "intervals" : ["2018-07-14/2018-07-15"]
      },
      "parser" : {
        "type" : "hadoopyString",
        "parseSpec" : {
          "format" : "tsv",
          "dimensionsSpec" : {
            "dimensions" : [
              "country",
              "region",
              "supplierID",
              "segmentID"
            ]
          },
          "timestampSpec" : {
            "format" : "auto",
            "column" : "time"
          },
          "columns" : [
              "time",
              "ipAddress",
              "country",
              "region",
              "URL",
              "referrer",
              "userAgent",
              "userID",
              "supplierID",
              "segmentID",
              "requestID"
          ],
          "listDelimiter" : ","
        }
      },
      "metricsSpec" : [
        {
          "name" : "count",
          "type" : "count"
        },
      {
        "type" : "thetaSketch",
          "name" : "user_unique_theta",
          "fieldName" : "userID",
          "isInputThetaSketch": false,
          "size": 16384
        }
      ]
    }
So, most of columns are skipped, high cardinality column is userID, which we only store as a sketch. Dimension columns has cardinalities:
country - 250
region -  5K (the most frequent value is empty though) 
supplier - 1K
segmentID - 100K
There are dependency relations between region and country, and supplier-segmentID, but still total number of permutations is quite high.
Please, let me know if there is something suspicious here.

I tried Timeseries query instead of GroupBy, it is indeed faster, looks like it makes sense for us to have another monthly dataset, instead of aggregating data from daily granularity dataset.



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxx
For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxx