git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: scrolling in ElasticSearch adapter


What I was saying is that scrolling is the only way to ensure you get
correct results coming back from Elasticsearch if you are going to do more
processing.

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html

As long as you delete the scroll at the end its not the end of the world.
There is no deep paging unless you use scrolling or a search after which is
basically the same idea as scrolling
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-after.html
.

I would say that scrolling should be the default behavior and then should
look at optimizing for aggregations, smaller result sets, etc. If you look
at scrolling as an optimization the default results won't be correct (ie:
only returning 10 results when there are a lot more). For default
pagination there is a limit of 10000 results anyway unless you use
scrolling or search after.

Kevin Risden


On Thu, Oct 25, 2018 at 3:33 PM Andrei Sereda <andrei@xxxxxxxxx> wrote:

> Hi Kevin,
>
> You suggest to use scrolling for all elastic queries ? Even when there
> are predicates ?  Some questions :
>
> 1) Scrolling has a runtime overhead for elastic cluster. Having it
> enabled by default (against vendor recommendation) is risky. Does it
> (not) cause issues in Solr ?
> 2) Scrolling does not work with aggregates, which means I'll have to
> fallback to pagination (for whole result set).
> 3) When result set is small scrolling / pagination is not necessary.
>
> Having said that I don't see any way to automatically enable / disable
> scrolling based on RelTree (pls let me know if there is one).
>
> You seem to agree with Stamatis that `select * from elastic` (1)
> should return all documents (rows) not just 10 (ES default). If so,
> one either has to use pagination or scrolling.
> What about queries `select count(*), foo from elastic group by foo`
> (2) (term aggregation). In this case only pagination works. If calcite
> returns all rows for (1) it should do the same for (2).  Should one
> use pagination with aggregates (2) and scrolling for simple queries
> (1) ?
> What if user specifies a high `LIMIT` (eg 10M) should we batch the result ?
> Should user explicitly enable scrolling (eg. JDBC api) ? If so, should
> one throw an exception when scrolling and aggregation is used or
> quietly default to pagination ?
>
> Thanks everybody for your input, it helps.
>
> Regards,
> Andrei.
>
>
> On Thu, Oct 25, 2018 at 2:08 PM
> Kevin Risden
> <krisden@xxxxxxxxxx> wrote:
> >
> > >
> > > There is one more “issue”. Currently select * from elastic returns at
> most
> > > 10 rows (in calcite). This is consistent with elastic behaviour which
> > > limits result set to 10 documents (unless size is specified).
> >
> >
> > In Solr land for the Calcite integration it uses the /export handler or
> > streaming expressions when there is no limit specified (these are just
> like
> > scrolling or called CursorMark in Solr). It falls back to /select with
> the
> > default limit of 10 rows.
> >
> > I think that scroll should be the default almost always since it is the
> > only way to return all the results. Even with a limit = 100000 would
> cause
> > problems with the default query handler. Paging deep into the result set
> is
> > not something that Lucene (Elasticsearch or Solr) does well.
> >
> >
> > Kevin Risden
> >
> >
> > On Thu, Oct 25, 2018 at 1:31 PM Julian Hyde <jhyde@xxxxxxxxxx> wrote:
> >
> > > Do you need to generate a different plan (i.e. a different tree of
> > > RelNodes) for scrolling vs non-scrolling? If so, it’s certainly
> > > inconvenient that you don’t know until execute time whether they want
> > > scrolling. A possible solution would be to generate TWO plans - one
> > > scrolling, one non-scrolling inside the prepared statement - and pick
> which
> > > one based on the runtime context.
> > >
> > > Julian
> > >
> > > > On Oct 25, 2018, at 1:42 AM, Christian Beikov <
> > > christian.beikov@xxxxxxxxx> wrote:
> > > >
> > > > Hey Andrei,
> > > >
> > > > I don't have an answer for how you can access these settings from
> within
> > > the adapter nor how one could do that via RelNodes but the suggestion
> to
> > > use DataContext for that purpose sounds reasonable. Maybe someone else
> has
> > > an idea?
> > > >
> > > > Anyway, since these are settings that don't affect the general
> semantics
> > > of the query/statement and also usually require a special API to be
> used,
> > > I'd rather see these aspects not end up in the query string.
> > > >
> > > > Am 25.10.2018 um 02:15 schrieb Andrei Sereda:
> > > >> Christian,
> > > >>
> > > >> I like TYPE_SCROLL_INSENSITIVE / fetchSize in PreparedStatement
> > > >> generally but have some reservations (questions) :
> > > >>
> > > >> How to pass resultSetType / fetchSize from PreparedStatement to
> > > RelNodes ?
> > > >> What if user doesn’t use JDBC (eg. RelBuilders) ?
> > > >> On Wed, Oct 24, 2018 at 6:28 PM Christian Beikov
> > > >> <christian.beikov@xxxxxxxxx> wrote:
> > > >>> In JDBC one can configure a fetch size which would reflect the
> amount
> > > of
> > > >>> rows to be fetched initially, but also subsequently.
> > > >>>
> > >
> https://docs.oracle.com/javase/7/docs/api/java/sql/Statement.html#setFetchSize(int)
> > > >>>
> > > >>> According to what you are writing, ES behvior is what
> > > >>> TYPE_SCROLL_INSENSITIVE would do i.e. provide a snapshot view that
> > > isn't
> > > >>> affected by changes.
> > > >>>
> > > >>> IMO TYPE_SCROLL_SENSITIVE means that if you have rows R1, R2, R3,
> R4,
> > > >>> ... and view R1, R2, then R3 is deleted and you fetch the next
> rows,
> > > you
> > > >>> wouldn't see R3.
> > > >>>
> > > >>> According to the JDBC spec
> > > >>> (
> > >
> https://docs.oracle.com/javase/7/docs/api/java/sql/Connection.html#prepareStatement(java.lang.String,%20int,%20int)
> > > >>> ) you don't have to support all modes. Usually, user code doesn't
> use
> > > >>> scrolling that much, but at least forward scrolling makes sense.
> > > >>>
> > > >>> Am 24.10.2018 um 21:38 schrieb Andrei Sereda:
> > > >>>> Hi Julian,
> > > >>>>
> > > >>>> Scrolling (in elastic) does not only mean “open a cursor” but also
> > > iterate
> > > >>>> over consistent snapshot. From docs:
> > > >>>>
> > > >>>> The results that are returned from a scroll request reflect the
> state
> > > of
> > > >>>> the index at the time that the initial search request was made,
> like a
> > > >>>> snapshot in time. Subsequent changes to documents (index, update
> or
> > > delete)
> > > >>>> will only affect later search requests.
> > > >>>>
> > > >>>> So pagination (fetch / offset) can’t exactly replicate this
> > > functionality.
> > > >>>>
> > > >>>> The problem with scrolling (in elastic) is that it is expensive
> and
> > > can’t
> > > >>>> (shouldn’t) be enabled it by default.
> > > >>>>
> > > >>>> There is one more “issue”. Currently select * from elastic
> returns at
> > > most
> > > >>>> 10 rows (in calcite). This is consistent with elastic behaviour
> which
> > > >>>> limits result set to 10 documents (unless size is specified). When
> > > >>>> returning a cursor (eg. using JDBC TYPE_SCROLL_SENSITIVE
> > > >>>> <
> > >
> https://docs.oracle.com/javase/7/docs/api/java/sql/ResultSet.html#TYPE_SCROLL_SENSITIVE
> > > >
> > > >>>> or SQL hint) does it mean return whole elastic index ? I’m not at
> > > ease with
> > > >>>> returning different results based on hints or cursor settings.
> > > >>>>
> > > >>>> Andrei.
> > > >>>>
> > > >>>> On Wed, Oct 24, 2018 at 3:02 PM Julian Hyde <
> jhyde.apache@xxxxxxxxx>
> > > wrote:
> > > >>>>
> > > >>>>> It seems to me that Elasticsearch scroll means return a cursor -
> a
> > > >>>>> collection of rows that you iterate over, and you may not read
> all
> > > of them.
> > > >>>>> This is the default operation of JDBC.
> > > >>>>>
> > > >>>>> So, I guess we need to give the user a way to signal their
> intent to
> > > read
> > > >>>>> all rows versus only the first few. Oracle’s FIRST_ROWS and
> ALL_ROWS
> > > >>>>> hints[1] seem close to this. We would want the hints to be acted
> > > upon by
> > > >>>>> both the optimizer and the JDBC transport.
> > > >>>>>
> > > >>>>> Related is pagination. SQL has FETCH and OFFSET, which allow you
> to
> > > >>>>> retrieve different pieces of a large result set in separate
> > > statements or
> > > >>>>> (using query parameters) executions. It would be useful if the
> > > server could
> > > >>>>> be given a hint to cache a statement across page requests.
> > > >>>>>
> > > >>>>> Julian
> > > >>>>>
> > > >>>>> [1]
> > > >>>>>
> > >
> https://docs.oracle.com/cd/B10500_01/server.920/a96533/hintsref.htm#4924
> > > >>>>>
> > > >>>>>> On Oct 24, 2018, at 11:19 AM, Christian Beikov <
> > > >>>>> christian.beikov@xxxxxxxxx> wrote:
> > > >>>>>> Hey,
> > > >>>>>>
> > > >>>>>> not sure if this should be an SQL keyword. JDBC specifies
> various
> > > >>>>> constants that can be used at statement creation time:
> > > >>>>>
> https://docs.oracle.com/javase/7/docs/api/java/sql/ResultSet.html
> > > >>>>>> Not sure though if or how these configurations are accessible
> for
> > > data
> > > >>>>> stores or dialects, but IMO using these would be the proper way.
> > > >>>>>> Regards
> > > >>>>>>
> > > >>>>>> Christian
> > > >>>>>>
> > > >>>>>>> Am 24.10.2018 um 18:44 schrieb Andrei Sereda:
> > > >>>>>>> Hello,
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> I was thinking about adding [scrolling functionality](
> > > >>>>>>>
> > > >>>>>
> > >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
> > > >>>>> )
> > > >>>>>>> to elastic search adapter. Since scrolling has non-negligible
> > > effect on
> > > >>>>> the
> > > >>>>>>> cluster it should be selectively enabled on per query basis.
> So,
> > > likely,
> > > >>>>>>> user has to explicitly set "scroll flag" somewhere.
> > > >>>>>>>
> > > >>>>>>> Most natural way seems in SQL. [Calcite sql grammar](
> > > >>>>>>> https://calcite.apache.org/docs/reference.html) has `SCROLL`
> > > keyword
> > > >>>>>>> (unused to my knowledge). There were also discussions about
> adding
> > > >>>>> hints to
> > > >>>>>>> Calcite.
> > > >>>>>>>
> > > >>>>>>> ### Examples
> > > >>>>>>> ```sql
> > > >>>>>>> -- special sql keyword ?
> > > >>>>>>> SCROLL select * from elastic;
> > > >>>>>>>
> > > >>>>>>> -- assuming hints are available in calcite
> > > >>>>>>> /* HINT: scroll */ select * from elastic;
> > > >>>>>>> ```
> > > >>>>>>>
> > > >>>>>>> What people think about this use-case ? Are there better ideas
> ?
> > > >>>>>>>
> > > >>>>>>> Regards,
> > > >>>>>>> Andrei.
> > > >>>>>>>
> > >
> > >
>