Re: [druid-dev] 'taskDuration' in Kafka Indexing Service causes more problems than it solves
The 'taskDuration' configuration is necessary because otherwise the Kafka
indexing task would not know when it should stop and publish the segments.
Running without this configuration could only be viable if you were working
with a stream that never had late data for any reason which is something
that systems have to deal with. It would also prevent you from reading
historical data which is a supported scenario for the Kafka indexing
As an example for if you didn't have it, let's say you have a task that is
receiving events that all fall into the 2018-04-01T01:00/2018-04-01T02:00
segment. How would you know when that segment is done and the task can
complete? Do you wait for the first event with a timestamp >
2018-04-01T02:00? Do you close the segment X minutes after the server time
moves into the next hour? What happens if you receive data that falls into
that segment later on? Do you create a new task to read that message, and
when does that task complete?
Similarly for historical data, if we had a stream that contained events
with timestamps from a year ago, how would the task know that it is done?
It is these kind of scenarios why the lifetime of a task can't be coupled
to event timestamps (which is what determines the segment intervals).
Like Parag and Jihoon mentioned, incremental handoff should help, but you
can still get sub-optimal segments if you have late data. Automated
compaction (https://github.com/druid-io/druid/pull/5102) should help with
On Thu, Apr 5, 2018 at 2:06 PM, Jihoon Son <jihoonson@xxxxxxxxxx> wrote:
> Hi Prashant,
> Kafka indexing service is still an experimental feature and has been being
> actively developed. There might be some issues because of the fast and
> active development. I think 'taskDuration' is one of such issues.
> When kafka indexing service was first introduced, it was not capable of
> incremental handoff and 'taskDuration' was the only way to publish segments
> generated by kafka index tasks.
> Incremental handoff of kafka index tasks is fairly new feature introduced
> in 0.12.0 which is our latest release. So, there might be some more issues
> including the one you pointed out.
> The issue of 'taskDuration' might be solved by running kafka index tasks
> forever instead of running new tasks per 'taskDuration'. To do so, as Parag
> pointed out, some issues should be addressed first like incremental
> uploading task logs. Do you have any idea to fix this issue? I will happily
> help you.
> 2018년 4월 5일 (목) 오전 10:35, Prashant Deva <prashant.deva@xxxxxxxxx>님이 작성:
> > setting to 3x doesnt solve the problem. it just delays it. in this case
> > the 3rd segment created would be split in uneven parts.
> > On Thu, Apr 5, 2018 at 10:31 AM Parag Jain <pjain11@xxxxxxxx> wrote:
> >> You can keep the task duration to something like 3-5X of segment
> >> granularity to avoid this problem. In ideal world when task logs can
> >> be uploaded incrementally to some long term store, tasks should actually
> >> run forever and will get replaced only when they fail.
> >> On Wednesday, April 4, 2018, 10:26:35 PM CDT, Prashant Deva <
> >> prashant.deva@xxxxxxxxx> wrote:
> >> I believe ‘taskDuration’ field for Kafka Indexing Service does more
> >> than good. Here is why:
> >> Lets assume that I am creating hourly segments ("segmentGranularity":
> >> "HOUR").
> >> If I am ingesting via Kafka, I typically want to persist a segment:
> >> 1. When segmentGranularity is reached. That is, every hour.
> >> Thus I want my segments to look like :
> >> 01:00-02:00
> >> 02:00-03:00
> >> 2. If a segment is getting too large, then I want to split it even if
> >> segment granularity hasn't been reached. "maxRowsPerSegment" achieves
> >> Now there is the `taskDuration` field. I set it to "taskDuration":
> >> assuming this will result in the segments I want above. However, turns
> >> that is not the case!
> >> Unless I submit my supervisor spec at EXACTLY 01:00, the segments will
> >> now be created every hour from when I SUBMITTED my supervisor spec.
> >> So instead of, segments like:
> >> 01:00-02:00
> >> 02:00-03:00
> >> I now get segments like:
> >> 01:15-02:00
> >> 02:00-02:15
> >> 02:15-03:00
> >> The segments are now broken not just by segment granularity but also by
> >> 'taskDuration'.
> >> This is not what I wanted! Now the segments for every hour are split
> >> atleast 2 non-optimal sized segments.
> >> The segments created shouldn't be dependent on when the supervisor spec
> >> was submitted.
> >> The 'taskDuration' field thus is not only not necessary for Kafka
> >> Indexing Service, it actually results in unwanted behavior.
> >> Note: I tried cross-posting to apache mailing list. However, the druid
> >> page on apache website (http://incubator.apache.org/projects/druid.html
> >> has no instructions on how to actually view or subscribe to the list.
> >> --
> >> You received this message because you are subscribed to the Google
> >> "Druid Development" group.
> >> To unsubscribe from this group and stop receiving emails from it, send
> >> email to druid-development+unsubscribe@xxxxxxxxxxxxxxxx.
> >> To post to this group, send email to druid-development@xxxxxxxxxxxxxxxx
> >> To view this discussion on the web visit
> >> https://groups.google.com/d/msgid/druid-development/0e680eed
> >> <https://groups.google.com/d/msgid/druid-development/0e680ee
> >> .
> >> For more options, visit https://groups.google.com/d/optout.
> > --
> > Prashant
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Druid Development" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to druid-development+unsubscribe@xxxxxxxxxxxxxxxx.
> > To post to this group, send email to druid-development@xxxxxxxxxxxxxxxx.
> > To view this discussion on the web visit
> > https://groups.google.com/d/msgid/druid-development/CAEg0NFa
> > <https://groups.google.com/d/msgid/druid-development/CAEg0NF
> > .
> > For more options, visit https://groups.google.com/d/optout.