git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Implementing a “join” between a DataStream and a “set of rules”


Hi Amit,

In my current approach the idea for updating rule set data was to have some kind of a "control" stream that will trigger an update to a local data structure, or a "control" event within the main data stream that will trigger the same.

Using external system like a cache or database is also an option, but that still will require some kind of a trigger to reload rule set or a single rule, in case of any updates to it.

Others have suggested using Flink managed state, but I'm still not sure whether that is a generally recommended approach in this scenario, as it seems like it was more meant for windowing-type processing instead?

Thanks,
Turar

On 6/5/18, 8:46 AM, "Amit Jain" <aj2011it@xxxxxxxxx> wrote:

    Hi Sandybayev,
    
    In the current state, Flink does not provide a solution to the
    mentioned use case. However, there is open FLIP[1] [2] which has been
    created to address the same.
    
    I can see in your current approach, you are not able to update the
    rule set data. I think you can update rule set data by building
    DataStream around changelogs which are stored in message
    queue/distributed file system.
    OR
    You can store rule set data in the external system where you can query
    for incoming keys from Flink.
    
    [1]: https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API
    [2]: https://issues.apache.org/jira/browse/FLINK-6131
    
    On Tue, Jun 5, 2018 at 5:52 PM, Sandybayev, Turar (CAI - Atlanta)
    <Turar.Sandybayev@xxxxxxxxxxxxxx> wrote:
    > Hi,
    >
    >
    >
    > What is the best practice recommendation for the following use case? We need
    > to match a stream against a set of “rules”, which are essentially a Flink
    > DataSet concept. Updates to this “rules set" are possible but not frequent.
    > Each stream event must be checked against all the records in “rules set”,
    > and each match produces one or more events into a sink. Number of records in
    > a rule set are in the 6 digit range.
    >
    >
    >
    > Currently we're simply loading rules into a local List of rules and using
    > flatMap over an incoming DataStream. Inside flatMap, we're just iterating
    > over a list comparing each event to each rule.
    >
    >
    >
    > To speed up the iteration, we can also split the list into several batches,
    > essentially creating a list of lists, and creating a separate thread to
    > iterate over each sub-list (using Futures in either Java or Scala).
    >
    >
    >
    > Questions:
    >
    > 1.            Is there a better way to do this kind of a join?
    >
    > 2.            If not, is it safe to add additional parallelism by creating
    > new threads inside each flatMap operation, on top of what Flink is already
    > doing?
    >
    >
    >
    > Thanks in advance!
    >
    > Turar
    >
    >