git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [druid-dev] 'taskDuration' in Kafka Indexing Service causes more problems than it solves


Hi Prashant,

Kafka indexing service is still an experimental feature and has been being
actively developed. There might be some issues because of the fast and
active development. I think 'taskDuration' is one of such issues.

When kafka indexing service was first introduced, it was not capable of
incremental handoff and 'taskDuration' was the only way to publish segments
generated by kafka index tasks.

Incremental handoff of kafka index tasks is fairly new feature introduced
in 0.12.0 which is our latest release. So, there might be some more issues
including the one you pointed out.

The issue of 'taskDuration' might be solved by running kafka index tasks
forever instead of running new tasks per 'taskDuration'. To do so, as Parag
pointed out, some issues should be addressed first like incremental
uploading task logs. Do you have any idea to fix this issue? I will happily
help you.

Jihoon

2018년 4월 5일 (목) 오전 10:35, Prashant Deva <prashant.deva@xxxxxxxxx>님이 작성:

> setting to 3x doesnt solve the problem. it just delays it. in this case
> the 3rd segment created would be split in uneven parts.
>
>
>
> On Thu, Apr 5, 2018 at 10:31 AM Parag Jain <pjain11@xxxxxxxx> wrote:
>
>> You can keep the task duration to something like 3-5X of segment
>> granularity to avoid this problem. In ideal world when task logs can also
>> be uploaded incrementally to some long term store, tasks should actually
>> run forever and will get replaced only when they fail.
>>
>>
>> On Wednesday, April 4, 2018, 10:26:35 PM CDT, Prashant Deva <
>> prashant.deva@xxxxxxxxx> wrote:
>>
>>
>> I believe ‘taskDuration’  field for Kafka Indexing Service does more harm
>> than good. Here is why:
>>
>> Lets assume that I am creating hourly segments ("segmentGranularity":
>> "HOUR").
>>
>>
>> If I am ingesting via Kafka, I typically want to persist a segment:
>>
>> 1. When segmentGranularity is reached. That is, every hour.
>> Thus I want my segments to look like :
>> 01:00-02:00
>> 02:00-03:00
>>
>>
>> 2. If a segment is getting too large, then I want to split it even if the
>> segment granularity hasn't been reached. "maxRowsPerSegment" achieves this.
>>
>>
>> Now there is the `taskDuration` field. I set it to "taskDuration": "PT1H"
>> assuming this will result in the segments I want above. However, turns out
>> that is not the case!
>>
>> Unless I submit my supervisor spec at EXACTLY 01:00, the segments will
>> now be created every hour from when I SUBMITTED my supervisor spec.
>>
>> So instead of, segments like:
>> 01:00-02:00
>> 02:00-03:00
>>
>> I now get segments like:
>> 01:15-02:00
>> 02:00-02:15
>> 02:15-03:00
>>
>> The segments are now broken not just by segment granularity but also by
>> 'taskDuration'.
>> This is not what I wanted! Now the segments for every hour are split into
>> atleast 2 non-optimal sized segments.
>>
>>
>> The segments created shouldn't be dependent on when the supervisor spec
>> was submitted.
>> The 'taskDuration' field thus is not only not necessary for Kafka
>> Indexing Service, it actually results in unwanted behavior.
>>
>>
>> Note: I tried cross-posting to apache mailing list. However, the druid
>> page on apache website (http://incubator.apache.org/projects/druid.html)
>> has no instructions on how to actually view or subscribe to the list.
>>
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Druid Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to druid-development+unsubscribe@xxxxxxxxxxxxxxxx.
>>
>> To post to this group, send email to druid-development@xxxxxxxxxxxxxxxx.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/druid-development/0e680eed-d1b6-4277-8e0f-51cfa78c3ed7%40googlegroups.com
>> <https://groups.google.com/d/msgid/druid-development/0e680eed-d1b6-4277-8e0f-51cfa78c3ed7%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> Prashant
>
> --
> You received this message because you are subscribed to the Google Groups
> "Druid Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to druid-development+unsubscribe@xxxxxxxxxxxxxxxx.
> To post to this group, send email to druid-development@xxxxxxxxxxxxxxxx.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/druid-development/CAEg0NFaJyVsjj5hjmHGPWoX2gAebHgcPud7kaUS%2Bbw15OHN1Dw%40mail.gmail.com
> <https://groups.google.com/d/msgid/druid-development/CAEg0NFaJyVsjj5hjmHGPWoX2gAebHgcPud7kaUS%2Bbw15OHN1Dw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>


( ! ) Warning: include(msgfooter.php): failed to open stream: No such file or directory in /var/www/git/apache-druid-developers/msg00042.html on line 190
Call Stack
#TimeMemoryFunctionLocation
10.0007368728{main}( ).../msg00042.html:0

( ! ) Warning: include(): Failed opening 'msgfooter.php' for inclusion (include_path='.:/var/www/git') in /var/www/git/apache-druid-developers/msg00042.html on line 190
Call Stack
#TimeMemoryFunctionLocation
10.0007368728{main}( ).../msg00042.html:0