git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Routing events by key


Note that even if you use GroupByKey and a 1 second window, it could be that key K at time T1 and T2 are scheduled to be processed in parallel which means that you will still need locking.

Apache Beam has no transform which allows you to partition the data how you want without using synchronization/locking/... unless your underlying storage engine allows you to pass in user specified version numbers which then you could use the windowing information to produce larger and larger version numbers so the storage engine would know which write it should keep and which write it should discard.

Alternatively, if you know which runner you want to use, it may be that intrinsically via some execution properties of the runner you ca get what you need but you'll have a pipeline which isn't following strict Apache Beam semantics and if the runner was to change, it may break you.

Finally, if none of that works out, you'll want to use a stream processing engine that allows you to specifically say that any key range should only ever be processed on a single machine at a time. This can have lots of its own problems if you hit a hot key since one machine will be swamped processing while the others are relatively idle.


On Fri, Jul 6, 2018 at 8:13 AM Jean-Baptiste Onofré <jb@xxxxxxxxxxxx> wrote:
Hi Niels,

as you have an Unbounded PCollection, you need a Window to GroupByKey
and then "forward" the data.

Another option would be to use a DoFn working element per element and
eventually batching then. It's what most of the IOs are doing for the
Write part.

Regards
JB

On 06/07/2018 17:01, Niels Basjes wrote:
> Hi,
>
> I have an unbounded stream of change events each of which has the id of
> the entity that is changed.
> To avoid the need for locking in the persistence layer that is needed in
> part of my processing I want to route all events based on this entity id.
> That way I know for sure that all events around a single entity go
> through the same instance of my processing sequentially, hence no need
> for locking or other synchronization regarding this persistence.
>
> At this point my best guess is that I need to use the GroupByKey but
> that seems to need a Window. 
> I think I don't want a window because the stream is unbounded and I want
> the lowest possible latency (i.e. a Window of 1 second would be ok for
> this usecase).
> Also I want to be 100% sure that all events for a specific id go to only
> a single instance because I do not want any race conditions.
>
> My simple question is: What does that look like in the Beam Java API?
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes

--
Jean-Baptiste Onofré
jbonofre@xxxxxxxxxx
http://blog.nanthrax.net
Talend - http://www.talend.com