Re: Implementing a “join” between a DataStream and a “set of rules”
In my current approach the idea for updating rule set data was to have some kind of a "control" stream that will trigger an update to a local data structure, or a "control" event within the main data stream that will trigger the same.
Using external system like a cache or database is also an option, but that still will require some kind of a trigger to reload rule set or a single rule, in case of any updates to it.
Others have suggested using Flink managed state, but I'm still not sure whether that is a generally recommended approach in this scenario, as it seems like it was more meant for windowing-type processing instead?
On 6/5/18, 8:46 AM, "Amit Jain" <aj2011it@xxxxxxxxx> wrote:
In the current state, Flink does not provide a solution to the
mentioned use case. However, there is open FLIP  which has been
created to address the same.
I can see in your current approach, you are not able to update the
rule set data. I think you can update rule set data by building
DataStream around changelogs which are stored in message
queue/distributed file system.
You can store rule set data in the external system where you can query
for incoming keys from Flink.
On Tue, Jun 5, 2018 at 5:52 PM, Sandybayev, Turar (CAI - Atlanta)
> What is the best practice recommendation for the following use case? We need
> to match a stream against a set of “rules”, which are essentially a Flink
> DataSet concept. Updates to this “rules set" are possible but not frequent.
> Each stream event must be checked against all the records in “rules set”,
> and each match produces one or more events into a sink. Number of records in
> a rule set are in the 6 digit range.
> Currently we're simply loading rules into a local List of rules and using
> flatMap over an incoming DataStream. Inside flatMap, we're just iterating
> over a list comparing each event to each rule.
> To speed up the iteration, we can also split the list into several batches,
> essentially creating a list of lists, and creating a separate thread to
> iterate over each sub-list (using Futures in either Java or Scala).
> 1. Is there a better way to do this kind of a join?
> 2. If not, is it safe to add additional parallelism by creating
> new threads inside each flatMap operation, on top of what Flink is already
> Thanks in advance!