git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: scrolling in ElasticSearch adapter


There is new Composite Aggregation
<https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html>
(still in beta) which allows pagination with aggregates. It is available on
versions >= 5.6 (unfortunately we have to support 2.6+).

To sum up:

   1. For non-aggregate queries always use scrolling or search-after by
   default (not pagination) .
   2. For aggregates use pagination. Either Terms Aggregation (with
   partitioning) or Composite Aggregation.
   3. If possible Retrieve batch size from JDBC api (known as
   Statement.setFetchSize(int)). If not default to 10k ?

Does it sound reasonable ?


On Thu, Oct 25, 2018 at 3:43 PM Kevin Risden <krisden@xxxxxxxxxx> wrote:

> What I was saying is that scrolling is the only way to ensure you get
> correct results coming back from Elasticsearch if you are going to do more
> processing.
>
>
> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
>
> As long as you delete the scroll at the end its not the end of the world.
> There is no deep paging unless you use scrolling or a search after which is
> basically the same idea as scrolling
>
> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-after.html
> .
>
> I would say that scrolling should be the default behavior and then should
> look at optimizing for aggregations, smaller result sets, etc. If you look
> at scrolling as an optimization the default results won't be correct (ie:
> only returning 10 results when there are a lot more). For default
> pagination there is a limit of 10000 results anyway unless you use
> scrolling or search after.
>
> Kevin Risden
>
>
> On Thu, Oct 25, 2018 at 3:33 PM Andrei Sereda <andrei@xxxxxxxxx> wrote:
>
> > Hi Kevin,
> >
> > You suggest to use scrolling for all elastic queries ? Even when there
> > are predicates ?  Some questions :
> >
> > 1) Scrolling has a runtime overhead for elastic cluster. Having it
> > enabled by default (against vendor recommendation) is risky. Does it
> > (not) cause issues in Solr ?
> > 2) Scrolling does not work with aggregates, which means I'll have to
> > fallback to pagination (for whole result set).
> > 3) When result set is small scrolling / pagination is not necessary.
> >
> > Having said that I don't see any way to automatically enable / disable
> > scrolling based on RelTree (pls let me know if there is one).
> >
> > You seem to agree with Stamatis that `select * from elastic` (1)
> > should return all documents (rows) not just 10 (ES default). If so,
> > one either has to use pagination or scrolling.
> > What about queries `select count(*), foo from elastic group by foo`
> > (2) (term aggregation). In this case only pagination works. If calcite
> > returns all rows for (1) it should do the same for (2).  Should one
> > use pagination with aggregates (2) and scrolling for simple queries
> > (1) ?
> > What if user specifies a high `LIMIT` (eg 10M) should we batch the
> result ?
> > Should user explicitly enable scrolling (eg. JDBC api) ? If so, should
> > one throw an exception when scrolling and aggregation is used or
> > quietly default to pagination ?
> >
> > Thanks everybody for your input, it helps.
> >
> > Regards,
> > Andrei.
> >
> >
> > On Thu, Oct 25, 2018 at 2:08 PM
> > Kevin Risden
> > <krisden@xxxxxxxxxx> wrote:
> > >
> > > >
> > > > There is one more “issue”. Currently select * from elastic returns at
> > most
> > > > 10 rows (in calcite). This is consistent with elastic behaviour which
> > > > limits result set to 10 documents (unless size is specified).
> > >
> > >
> > > In Solr land for the Calcite integration it uses the /export handler or
> > > streaming expressions when there is no limit specified (these are just
> > like
> > > scrolling or called CursorMark in Solr). It falls back to /select with
> > the
> > > default limit of 10 rows.
> > >
> > > I think that scroll should be the default almost always since it is the
> > > only way to return all the results. Even with a limit = 100000 would
> > cause
> > > problems with the default query handler. Paging deep into the result
> set
> > is
> > > not something that Lucene (Elasticsearch or Solr) does well.
> > >
> > >
> > > Kevin Risden
> > >
> > >
> > > On Thu, Oct 25, 2018 at 1:31 PM Julian Hyde <jhyde@xxxxxxxxxx> wrote:
> > >
> > > > Do you need to generate a different plan (i.e. a different tree of
> > > > RelNodes) for scrolling vs non-scrolling? If so, it’s certainly
> > > > inconvenient that you don’t know until execute time whether they want
> > > > scrolling. A possible solution would be to generate TWO plans - one
> > > > scrolling, one non-scrolling inside the prepared statement - and pick
> > which
> > > > one based on the runtime context.
> > > >
> > > > Julian
> > > >
> > > > > On Oct 25, 2018, at 1:42 AM, Christian Beikov <
> > > > christian.beikov@xxxxxxxxx> wrote:
> > > > >
> > > > > Hey Andrei,
> > > > >
> > > > > I don't have an answer for how you can access these settings from
> > within
> > > > the adapter nor how one could do that via RelNodes but the suggestion
> > to
> > > > use DataContext for that purpose sounds reasonable. Maybe someone
> else
> > has
> > > > an idea?
> > > > >
> > > > > Anyway, since these are settings that don't affect the general
> > semantics
> > > > of the query/statement and also usually require a special API to be
> > used,
> > > > I'd rather see these aspects not end up in the query string.
> > > > >
> > > > > Am 25.10.2018 um 02:15 schrieb Andrei Sereda:
> > > > >> Christian,
> > > > >>
> > > > >> I like TYPE_SCROLL_INSENSITIVE / fetchSize in PreparedStatement
> > > > >> generally but have some reservations (questions) :
> > > > >>
> > > > >> How to pass resultSetType / fetchSize from PreparedStatement to
> > > > RelNodes ?
> > > > >> What if user doesn’t use JDBC (eg. RelBuilders) ?
> > > > >> On Wed, Oct 24, 2018 at 6:28 PM Christian Beikov
> > > > >> <christian.beikov@xxxxxxxxx> wrote:
> > > > >>> In JDBC one can configure a fetch size which would reflect the
> > amount
> > > > of
> > > > >>> rows to be fetched initially, but also subsequently.
> > > > >>>
> > > >
> >
> https://docs.oracle.com/javase/7/docs/api/java/sql/Statement.html#setFetchSize(int)
> > > > >>>
> > > > >>> According to what you are writing, ES behvior is what
> > > > >>> TYPE_SCROLL_INSENSITIVE would do i.e. provide a snapshot view
> that
> > > > isn't
> > > > >>> affected by changes.
> > > > >>>
> > > > >>> IMO TYPE_SCROLL_SENSITIVE means that if you have rows R1, R2, R3,
> > R4,
> > > > >>> ... and view R1, R2, then R3 is deleted and you fetch the next
> > rows,
> > > > you
> > > > >>> wouldn't see R3.
> > > > >>>
> > > > >>> According to the JDBC spec
> > > > >>> (
> > > >
> >
> https://docs.oracle.com/javase/7/docs/api/java/sql/Connection.html#prepareStatement(java.lang.String,%20int,%20int)
> > > > >>> ) you don't have to support all modes. Usually, user code doesn't
> > use
> > > > >>> scrolling that much, but at least forward scrolling makes sense.
> > > > >>>
> > > > >>> Am 24.10.2018 um 21:38 schrieb Andrei Sereda:
> > > > >>>> Hi Julian,
> > > > >>>>
> > > > >>>> Scrolling (in elastic) does not only mean “open a cursor” but
> also
> > > > iterate
> > > > >>>> over consistent snapshot. From docs:
> > > > >>>>
> > > > >>>> The results that are returned from a scroll request reflect the
> > state
> > > > of
> > > > >>>> the index at the time that the initial search request was made,
> > like a
> > > > >>>> snapshot in time. Subsequent changes to documents (index, update
> > or
> > > > delete)
> > > > >>>> will only affect later search requests.
> > > > >>>>
> > > > >>>> So pagination (fetch / offset) can’t exactly replicate this
> > > > functionality.
> > > > >>>>
> > > > >>>> The problem with scrolling (in elastic) is that it is expensive
> > and
> > > > can’t
> > > > >>>> (shouldn’t) be enabled it by default.
> > > > >>>>
> > > > >>>> There is one more “issue”. Currently select * from elastic
> > returns at
> > > > most
> > > > >>>> 10 rows (in calcite). This is consistent with elastic behaviour
> > which
> > > > >>>> limits result set to 10 documents (unless size is specified).
> When
> > > > >>>> returning a cursor (eg. using JDBC TYPE_SCROLL_SENSITIVE
> > > > >>>> <
> > > >
> >
> https://docs.oracle.com/javase/7/docs/api/java/sql/ResultSet.html#TYPE_SCROLL_SENSITIVE
> > > > >
> > > > >>>> or SQL hint) does it mean return whole elastic index ? I’m not
> at
> > > > ease with
> > > > >>>> returning different results based on hints or cursor settings.
> > > > >>>>
> > > > >>>> Andrei.
> > > > >>>>
> > > > >>>> On Wed, Oct 24, 2018 at 3:02 PM Julian Hyde <
> > jhyde.apache@xxxxxxxxx>
> > > > wrote:
> > > > >>>>
> > > > >>>>> It seems to me that Elasticsearch scroll means return a cursor
> -
> > a
> > > > >>>>> collection of rows that you iterate over, and you may not read
> > all
> > > > of them.
> > > > >>>>> This is the default operation of JDBC.
> > > > >>>>>
> > > > >>>>> So, I guess we need to give the user a way to signal their
> > intent to
> > > > read
> > > > >>>>> all rows versus only the first few. Oracle’s FIRST_ROWS and
> > ALL_ROWS
> > > > >>>>> hints[1] seem close to this. We would want the hints to be
> acted
> > > > upon by
> > > > >>>>> both the optimizer and the JDBC transport.
> > > > >>>>>
> > > > >>>>> Related is pagination. SQL has FETCH and OFFSET, which allow
> you
> > to
> > > > >>>>> retrieve different pieces of a large result set in separate
> > > > statements or
> > > > >>>>> (using query parameters) executions. It would be useful if the
> > > > server could
> > > > >>>>> be given a hint to cache a statement across page requests.
> > > > >>>>>
> > > > >>>>> Julian
> > > > >>>>>
> > > > >>>>> [1]
> > > > >>>>>
> > > >
> > https://docs.oracle.com/cd/B10500_01/server.920/a96533/hintsref.htm#4924
> > > > >>>>>
> > > > >>>>>> On Oct 24, 2018, at 11:19 AM, Christian Beikov <
> > > > >>>>> christian.beikov@xxxxxxxxx> wrote:
> > > > >>>>>> Hey,
> > > > >>>>>>
> > > > >>>>>> not sure if this should be an SQL keyword. JDBC specifies
> > various
> > > > >>>>> constants that can be used at statement creation time:
> > > > >>>>>
> > https://docs.oracle.com/javase/7/docs/api/java/sql/ResultSet.html
> > > > >>>>>> Not sure though if or how these configurations are accessible
> > for
> > > > data
> > > > >>>>> stores or dialects, but IMO using these would be the proper
> way.
> > > > >>>>>> Regards
> > > > >>>>>>
> > > > >>>>>> Christian
> > > > >>>>>>
> > > > >>>>>>> Am 24.10.2018 um 18:44 schrieb Andrei Sereda:
> > > > >>>>>>> Hello,
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> I was thinking about adding [scrolling functionality](
> > > > >>>>>>>
> > > > >>>>>
> > > >
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
> > > > >>>>> )
> > > > >>>>>>> to elastic search adapter. Since scrolling has non-negligible
> > > > effect on
> > > > >>>>> the
> > > > >>>>>>> cluster it should be selectively enabled on per query basis.
> > So,
> > > > likely,
> > > > >>>>>>> user has to explicitly set "scroll flag" somewhere.
> > > > >>>>>>>
> > > > >>>>>>> Most natural way seems in SQL. [Calcite sql grammar](
> > > > >>>>>>> https://calcite.apache.org/docs/reference.html) has `SCROLL`
> > > > keyword
> > > > >>>>>>> (unused to my knowledge). There were also discussions about
> > adding
> > > > >>>>> hints to
> > > > >>>>>>> Calcite.
> > > > >>>>>>>
> > > > >>>>>>> ### Examples
> > > > >>>>>>> ```sql
> > > > >>>>>>> -- special sql keyword ?
> > > > >>>>>>> SCROLL select * from elastic;
> > > > >>>>>>>
> > > > >>>>>>> -- assuming hints are available in calcite
> > > > >>>>>>> /* HINT: scroll */ select * from elastic;
> > > > >>>>>>> ```
> > > > >>>>>>>
> > > > >>>>>>> What people think about this use-case ? Are there better
> ideas
> > ?
> > > > >>>>>>>
> > > > >>>>>>> Regards,
> > > > >>>>>>> Andrei.
> > > > >>>>>>>
> > > >
> > > >
> >
>