git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Druid + Theta Sketches performance


Honestly I do not remember how the dimension exclusion vs dimension
inclusion stuff works. I have to look it up every time. If you look at any
segment for that datasource in the Coordinator Console, it should give you
a list of dimensions and metrics. Do they match what you expect?

On Sun, Oct 21, 2018 at 9:17 PM alex.rnv.ru@xxxxxxxxx <alex.rnv.ru@xxxxxxxxx>
wrote:

>
>
> On 2018/10/19 14:42:18, Charles Allen <charles.allen@xxxxxxxx.INVALID>
> wrote:
> > This is a good callout. Those numbers still seem very slow. One item I'm
> > curious of is if you are dropping the id when you index, or if the id is
> > also being indexed into the druid segments.
> >
> > With how druid does indexing, it dictionary encodes all the dimension
> > values. So the cardinality of rows is a factor of QueryGranularity and
> the
> > cardinality of dimension value tuples per query granularity "bucket".
> This
> > allows dynamic slice and dice on the data. But if it is accidentally
> > including a dimension with very high cardinality (like ID) in the
> > dictionary encoding, then it is not able to make efficient use of
> roll-up.
> >
> > In order to facilitate dynamic slice and dice, the theta sketches need to
> > have *some* kind of object stored per dimension tuple per query
> granularity
> > (but only if the tuple appears in that bucket). So you can reduce the
> > number of things that get read off of disk by trying to increase the
> > rollup. Usually this is done by dropping or reducing high cardinality
> > dimensions, but can also be done by changing the query granularity.
> >
> > Another trick is to use topN or Timeseries. In general, those query types
> > are able to able to have better optimizations since they have a very
> > limited scope use case.
> >
> > Now, to Theta Sketches itself, I am not as familiar with the Theta
> Sketches
> > code paths. It is possible there are performance gains to be had.
> >
> > Hope this helps,
> > Charles Allen
> >
> >
> > On Fri, Oct 19, 2018 at 3:38 AM alex.rnv.ru@xxxxxxxxx <
> alex.rnv.ru@xxxxxxxxx>
> > wrote:
> >
> > > Hi Druid devs,
> > > I am testing Druid for our specific count distinct estimation case.
> Data
> > > was ingested via Hadoop indexer.
> > > When simplified, it has following schema:
> > > timestamp    key    country    theta-sketch<id>    event-counter
> > > So, there are 2 dimensions, one counter metric, one theta sketch
> metric.
> > > Data granularity is a DAY.
> > > Data source in deep storage is 150-200GB per day.
> > >
> > > I was doing some test runs with our small test cluster (4 Historical
> > > nodes, 8 CPU, 64GB RAM, 500SSD RAM). I admit with this RAM-SSD ratio
> and
> > > number of nodes it is not going to be fast. The question though is in
> > > theta-sketches performance compared to counters aggregation. The
> difference
> > > is an order of magnitude. E.g.: GroupBy query for a single key,
> aggregated
> > > on 7 days:
> > > event-counters - 30 seconds.
> > > theta-sketches -  7 minutes.
> > >
> > > Theta Sketch aggregation implies more work than summing up numbers of
> > > course. But Theta Sketch documentation says that union operation is
> very
> > > fast.
> > >
> > > I did some profiling of one of Historical nodes. Most of CPU time is
> spent
> > > in
> > >
> io.druid.query.aggregation.datasketches.theta.SketchObjectStrategy.fromByteBuffer(ByteBuffer,
> > > int). Which I think is moving Sketch objects from off-heap to managed
> heap.
> > > To be precise, time is spent in sketch library methods
> > > com.yahoo.memory.WritableMemoryImpl.region
> > > com.yahoo.memory.Memory.wrap
> > >
> > > Do not think anything is wrong with this code, except for why is it
> called
> > > so many times.
> > > Which leads to main question. I do not really understand how
> theta-sketch
> > > is stored in columnar database. Assuming it is stored same way as
> counter,
> > > it means that for every combination of "key" and "country" (dimensions
> from
> > > above) - there is a theta sketch structure to be stored. In our case
> "key"
> > > cardinality is quite high. Hence so many Sketch structure accesses in
> Java.
> > > Looks extremely ineffective. Again, it is just an assumption. Please
> excuse
> > > me if am wrong here.
> > >
> > > If you continue thinking in this direction, in terms of performance it
> > > makes sense to store one Theta sketch for every dimension value, so
> instead
> > > of having cardinality(key) * cardinality(countries) entries there will
> be
> > > cardinality(key) + cardinality(countries) sketches. In this case it
> looks
> > > like an index, not a part of columnar storage itself.
> > > Queries for 2 dimensions are easy, as there is only one INTERSECTION
> to be
> > > done. It all looks like a natural thing to do for sketches, as there
> will
> > > be a win in terms of storage and query performance.
> > > My question is if I am right or wrong in my assumptions. If my
> > > understanding is not correct and sketches are already stored in optimal
> > > way, could someone give advice on speeding up computations on a single
> > > Historical node? Otherwise, wanted to ask if there is an attempt or
> > > discussion to use sketches in the way I described.
> > > Thanks in advance.
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxx
> > > For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxx
> > >
> > >
> > Charles,
> thank you for quick reply.
> My data schema part in hadoop indexing spec looks like (we copy data from
> TSV):
>     "dataSchema" : {
>       "dataSource" : "theta-poc",
>       "granularitySpec" : {
>         "type" : "uniform",
>         "segmentGranularity" : "day",
>         "queryGranularity" : "none",
>         "intervals" : ["2018-07-14/2018-07-15"]
>       },
>       "parser" : {
>         "type" : "hadoopyString",
>         "parseSpec" : {
>           "format" : "tsv",
>           "dimensionsSpec" : {
>             "dimensions" : [
>               "country",
>               "region",
>               "supplierID",
>               "segmentID"
>             ]
>           },
>           "timestampSpec" : {
>             "format" : "auto",
>             "column" : "time"
>           },
>           "columns" : [
>               "time",
>               "ipAddress",
>               "country",
>               "region",
>               "URL",
>               "referrer",
>               "userAgent",
>               "userID",
>               "supplierID",
>               "segmentID",
>               "requestID"
>           ],
>           "listDelimiter" : ","
>         }
>       },
>       "metricsSpec" : [
>         {
>           "name" : "count",
>           "type" : "count"
>         },
>       {
>         "type" : "thetaSketch",
>           "name" : "user_unique_theta",
>           "fieldName" : "userID",
>           "isInputThetaSketch": false,
>           "size": 16384
>         }
>       ]
>     }
> So, most of columns are skipped, high cardinality column is userID, which
> we only store as a sketch. Dimension columns has cardinalities:
> country - 250
> region -  5K (the most frequent value is empty though)
> supplier - 1K
> segmentID - 100K
> There are dependency relations between region and country, and
> supplier-segmentID, but still total number of permutations is quite high.
> Please, let me know if there is something suspicious here.
>
> I tried Timeseries query instead of GroupBy, it is indeed faster, looks
> like it makes sense for us to have another monthly dataset, instead of
> aggregating data from daily granularity dataset.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxx
> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxx
>
>