git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: scrolling in ElasticSearch adapter


#1 sounds reasonable.
#2 will have to see what that means for actual aggregations. Sounds like it
could work.
#3 not sure what is meant here. I think fetch size is a hint and not
required. I wouldn't default to 10k though.

Kevin Risden


On Thu, Oct 25, 2018 at 4:25 PM Andrei Sereda <andrei@xxxxxxxxx> wrote:

> There is new Composite Aggregation
> <
> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html
> >
> (still in beta) which allows pagination with aggregates. It is available on
> versions >= 5.6 (unfortunately we have to support 2.6+).
>
> To sum up:
>
>    1. For non-aggregate queries always use scrolling or search-after by
>    default (not pagination) .
>    2. For aggregates use pagination. Either Terms Aggregation (with
>    partitioning) or Composite Aggregation.
>    3. If possible Retrieve batch size from JDBC api (known as
>    Statement.setFetchSize(int)). If not default to 10k ?
>
> Does it sound reasonable ?
>
>
> On Thu, Oct 25, 2018 at 3:43 PM
> Kevin Risden
> <krisden@xxxxxxxxxx> wrote:
>
> > What I was saying is that scrolling is the only way to ensure you get
> > correct results coming back from Elasticsearch if you are going to do
> more
> > processing.
> >
> >
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
> >
> > As long as you delete the scroll at the end its not the end of the world.
> > There is no deep paging unless you use scrolling or a search after which
> is
> > basically the same idea as scrolling
> >
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-after.html
> > .
> >
> > I would say that scrolling should be the default behavior and then should
> > look at optimizing for aggregations, smaller result sets, etc. If you
> look
> > at scrolling as an optimization the default results won't be correct (ie:
> > only returning 10 results when there are a lot more). For default
> > pagination there is a limit of 10000 results anyway unless you use
> > scrolling or search after.
> >
> > Kevin Risden
> >
> >
> > On Thu, Oct 25, 2018 at 3:33 PM Andrei Sereda <andrei@xxxxxxxxx> wrote:
> >
> > > Hi Kevin,
> > >
> > > You suggest to use scrolling for all elastic queries ? Even when there
> > > are predicates ?  Some questions :
> > >
> > > 1) Scrolling has a runtime overhead for elastic cluster. Having it
> > > enabled by default (against vendor recommendation) is risky. Does it
> > > (not) cause issues in Solr ?
> > > 2) Scrolling does not work with aggregates, which means I'll have to
> > > fallback to pagination (for whole result set).
> > > 3) When result set is small scrolling / pagination is not necessary.
> > >
> > > Having said that I don't see any way to automatically enable / disable
> > > scrolling based on RelTree (pls let me know if there is one).
> > >
> > > You seem to agree with Stamatis that `select * from elastic` (1)
> > > should return all documents (rows) not just 10 (ES default). If so,
> > > one either has to use pagination or scrolling.
> > > What about queries `select count(*), foo from elastic group by foo`
> > > (2) (term aggregation). In this case only pagination works. If calcite
> > > returns all rows for (1) it should do the same for (2).  Should one
> > > use pagination with aggregates (2) and scrolling for simple queries
> > > (1) ?
> > > What if user specifies a high `LIMIT` (eg 10M) should we batch the
> > result ?
> > > Should user explicitly enable scrolling (eg. JDBC api) ? If so, should
> > > one throw an exception when scrolling and aggregation is used or
> > > quietly default to pagination ?
> > >
> > > Thanks everybody for your input, it helps.
> > >
> > > Regards,
> > > Andrei.
> > >
> > >
> > > On Thu, Oct 25, 2018 at 2:08 PM
> > > Kevin Risden
> > > <krisden@xxxxxxxxxx> wrote:
> > > >
> > > > >
> > > > > There is one more “issue”. Currently select * from elastic returns
> at
> > > most
> > > > > 10 rows (in calcite). This is consistent with elastic behaviour
> which
> > > > > limits result set to 10 documents (unless size is specified).
> > > >
> > > >
> > > > In Solr land for the Calcite integration it uses the /export handler
> or
> > > > streaming expressions when there is no limit specified (these are
> just
> > > like
> > > > scrolling or called CursorMark in Solr). It falls back to /select
> with
> > > the
> > > > default limit of 10 rows.
> > > >
> > > > I think that scroll should be the default almost always since it is
> the
> > > > only way to return all the results. Even with a limit = 100000 would
> > > cause
> > > > problems with the default query handler. Paging deep into the result
> > set
> > > is
> > > > not something that Lucene (Elasticsearch or Solr) does well.
> > > >
> > > >
> > > > Kevin Risden
> > > >
> > > >
> > > > On Thu, Oct 25, 2018 at 1:31 PM Julian Hyde <jhyde@xxxxxxxxxx>
> wrote:
> > > >
> > > > > Do you need to generate a different plan (i.e. a different tree of
> > > > > RelNodes) for scrolling vs non-scrolling? If so, it’s certainly
> > > > > inconvenient that you don’t know until execute time whether they
> want
> > > > > scrolling. A possible solution would be to generate TWO plans - one
> > > > > scrolling, one non-scrolling inside the prepared statement - and
> pick
> > > which
> > > > > one based on the runtime context.
> > > > >
> > > > > Julian
> > > > >
> > > > > > On Oct 25, 2018, at 1:42 AM, Christian Beikov <
> > > > > christian.beikov@xxxxxxxxx> wrote:
> > > > > >
> > > > > > Hey Andrei,
> > > > > >
> > > > > > I don't have an answer for how you can access these settings from
> > > within
> > > > > the adapter nor how one could do that via RelNodes but the
> suggestion
> > > to
> > > > > use DataContext for that purpose sounds reasonable. Maybe someone
> > else
> > > has
> > > > > an idea?
> > > > > >
> > > > > > Anyway, since these are settings that don't affect the general
> > > semantics
> > > > > of the query/statement and also usually require a special API to be
> > > used,
> > > > > I'd rather see these aspects not end up in the query string.
> > > > > >
> > > > > > Am 25.10.2018 um 02:15 schrieb Andrei Sereda:
> > > > > >> Christian,
> > > > > >>
> > > > > >> I like TYPE_SCROLL_INSENSITIVE / fetchSize in PreparedStatement
> > > > > >> generally but have some reservations (questions) :
> > > > > >>
> > > > > >> How to pass resultSetType / fetchSize from PreparedStatement to
> > > > > RelNodes ?
> > > > > >> What if user doesn’t use JDBC (eg. RelBuilders) ?
> > > > > >> On Wed, Oct 24, 2018 at 6:28 PM Christian Beikov
> > > > > >> <christian.beikov@xxxxxxxxx> wrote:
> > > > > >>> In JDBC one can configure a fetch size which would reflect the
> > > amount
> > > > > of
> > > > > >>> rows to be fetched initially, but also subsequently.
> > > > > >>>
> > > > >
> > >
> >
> https://docs.oracle.com/javase/7/docs/api/java/sql/Statement.html#setFetchSize(int)
> > > > > >>>
> > > > > >>> According to what you are writing, ES behvior is what
> > > > > >>> TYPE_SCROLL_INSENSITIVE would do i.e. provide a snapshot view
> > that
> > > > > isn't
> > > > > >>> affected by changes.
> > > > > >>>
> > > > > >>> IMO TYPE_SCROLL_SENSITIVE means that if you have rows R1, R2,
> R3,
> > > R4,
> > > > > >>> ... and view R1, R2, then R3 is deleted and you fetch the next
> > > rows,
> > > > > you
> > > > > >>> wouldn't see R3.
> > > > > >>>
> > > > > >>> According to the JDBC spec
> > > > > >>> (
> > > > >
> > >
> >
> https://docs.oracle.com/javase/7/docs/api/java/sql/Connection.html#prepareStatement(java.lang.String,%20int,%20int)
> > > > > >>> ) you don't have to support all modes. Usually, user code
> doesn't
> > > use
> > > > > >>> scrolling that much, but at least forward scrolling makes
> sense.
> > > > > >>>
> > > > > >>> Am 24.10.2018 um 21:38 schrieb Andrei Sereda:
> > > > > >>>> Hi Julian,
> > > > > >>>>
> > > > > >>>> Scrolling (in elastic) does not only mean “open a cursor” but
> > also
> > > > > iterate
> > > > > >>>> over consistent snapshot. From docs:
> > > > > >>>>
> > > > > >>>> The results that are returned from a scroll request reflect
> the
> > > state
> > > > > of
> > > > > >>>> the index at the time that the initial search request was
> made,
> > > like a
> > > > > >>>> snapshot in time. Subsequent changes to documents (index,
> update
> > > or
> > > > > delete)
> > > > > >>>> will only affect later search requests.
> > > > > >>>>
> > > > > >>>> So pagination (fetch / offset) can’t exactly replicate this
> > > > > functionality.
> > > > > >>>>
> > > > > >>>> The problem with scrolling (in elastic) is that it is
> expensive
> > > and
> > > > > can’t
> > > > > >>>> (shouldn’t) be enabled it by default.
> > > > > >>>>
> > > > > >>>> There is one more “issue”. Currently select * from elastic
> > > returns at
> > > > > most
> > > > > >>>> 10 rows (in calcite). This is consistent with elastic
> behaviour
> > > which
> > > > > >>>> limits result set to 10 documents (unless size is specified).
> > When
> > > > > >>>> returning a cursor (eg. using JDBC TYPE_SCROLL_SENSITIVE
> > > > > >>>> <
> > > > >
> > >
> >
> https://docs.oracle.com/javase/7/docs/api/java/sql/ResultSet.html#TYPE_SCROLL_SENSITIVE
> > > > > >
> > > > > >>>> or SQL hint) does it mean return whole elastic index ? I’m not
> > at
> > > > > ease with
> > > > > >>>> returning different results based on hints or cursor settings.
> > > > > >>>>
> > > > > >>>> Andrei.
> > > > > >>>>
> > > > > >>>> On Wed, Oct 24, 2018 at 3:02 PM Julian Hyde <
> > > jhyde.apache@xxxxxxxxx>
> > > > > wrote:
> > > > > >>>>
> > > > > >>>>> It seems to me that Elasticsearch scroll means return a
> cursor
> > -
> > > a
> > > > > >>>>> collection of rows that you iterate over, and you may not
> read
> > > all
> > > > > of them.
> > > > > >>>>> This is the default operation of JDBC.
> > > > > >>>>>
> > > > > >>>>> So, I guess we need to give the user a way to signal their
> > > intent to
> > > > > read
> > > > > >>>>> all rows versus only the first few. Oracle’s FIRST_ROWS and
> > > ALL_ROWS
> > > > > >>>>> hints[1] seem close to this. We would want the hints to be
> > acted
> > > > > upon by
> > > > > >>>>> both the optimizer and the JDBC transport.
> > > > > >>>>>
> > > > > >>>>> Related is pagination. SQL has FETCH and OFFSET, which allow
> > you
> > > to
> > > > > >>>>> retrieve different pieces of a large result set in separate
> > > > > statements or
> > > > > >>>>> (using query parameters) executions. It would be useful if
> the
> > > > > server could
> > > > > >>>>> be given a hint to cache a statement across page requests.
> > > > > >>>>>
> > > > > >>>>> Julian
> > > > > >>>>>
> > > > > >>>>> [1]
> > > > > >>>>>
> > > > >
> > >
> https://docs.oracle.com/cd/B10500_01/server.920/a96533/hintsref.htm#4924
> > > > > >>>>>
> > > > > >>>>>> On Oct 24, 2018, at 11:19 AM, Christian Beikov <
> > > > > >>>>> christian.beikov@xxxxxxxxx> wrote:
> > > > > >>>>>> Hey,
> > > > > >>>>>>
> > > > > >>>>>> not sure if this should be an SQL keyword. JDBC specifies
> > > various
> > > > > >>>>> constants that can be used at statement creation time:
> > > > > >>>>>
> > > https://docs.oracle.com/javase/7/docs/api/java/sql/ResultSet.html
> > > > > >>>>>> Not sure though if or how these configurations are
> accessible
> > > for
> > > > > data
> > > > > >>>>> stores or dialects, but IMO using these would be the proper
> > way.
> > > > > >>>>>> Regards
> > > > > >>>>>>
> > > > > >>>>>> Christian
> > > > > >>>>>>
> > > > > >>>>>>> Am 24.10.2018 um 18:44 schrieb Andrei Sereda:
> > > > > >>>>>>> Hello,
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>> I was thinking about adding [scrolling functionality](
> > > > > >>>>>>>
> > > > > >>>>>
> > > > >
> > >
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
> > > > > >>>>> )
> > > > > >>>>>>> to elastic search adapter. Since scrolling has
> non-negligible
> > > > > effect on
> > > > > >>>>> the
> > > > > >>>>>>> cluster it should be selectively enabled on per query
> basis.
> > > So,
> > > > > likely,
> > > > > >>>>>>> user has to explicitly set "scroll flag" somewhere.
> > > > > >>>>>>>
> > > > > >>>>>>> Most natural way seems in SQL. [Calcite sql grammar](
> > > > > >>>>>>> https://calcite.apache.org/docs/reference.html) has
> `SCROLL`
> > > > > keyword
> > > > > >>>>>>> (unused to my knowledge). There were also discussions about
> > > adding
> > > > > >>>>> hints to
> > > > > >>>>>>> Calcite.
> > > > > >>>>>>>
> > > > > >>>>>>> ### Examples
> > > > > >>>>>>> ```sql
> > > > > >>>>>>> -- special sql keyword ?
> > > > > >>>>>>> SCROLL select * from elastic;
> > > > > >>>>>>>
> > > > > >>>>>>> -- assuming hints are available in calcite
> > > > > >>>>>>> /* HINT: scroll */ select * from elastic;
> > > > > >>>>>>> ```
> > > > > >>>>>>>
> > > > > >>>>>>> What people think about this use-case ? Are there better
> > ideas
> > > ?
> > > > > >>>>>>>
> > > > > >>>>>>> Regards,
> > > > > >>>>>>> Andrei.
> > > > > >>>>>>>
> > > > >
> > > > >
> > >
> >
>