git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Support for graph processing


Thanks for the reply. We are currently looking into the batch pipeline. Your suggested approaches are also the direction that we are investigating, next to using Spark or Flink directly.

Why would you say that it is better to materialize the intermediate state? Is this more/less efficient? Suppose I output my "finished" processing to one collection and I output my "unfinished" data to another PCollection. Will that still be less efficient?

Looking at the data, currently, I would need 6 iterations with an average of 4. However, that of course depends on the data which can evolve over time. So the current, custom, implementation has a hard limit of 15 that I would take over. How does this affect the performance if I have 9 times a NOOP?


Op di 29 mei 2018 om 17:09 schreef Lukasz Cwik <lcwik@xxxxxxxxxx>:
This has been a long requested feature within Apache Beam: https://issues.apache.org/jira/browse/BEAM-106

The short story is that this requires a lot of support from execution engines since watermarks become a concept of time + loop iteration (possibly multiple loop iterations if they are nested).

Some solutions right now:
* for a batch pipeline, unroll your own loops a number of times and place filters within each loop iteration to prevent downstream processing from occurring if there is nothing to do. Better yet would be to have your pipeline materialize its intermediate state after some number of loop iterations and then have the driver program relaunch the pipeline consuming the intermediate state if needed.
* for streaming pipeline, create a feedback loop via a source/sink pair like Kafka/Pubsub where the intermediate computation is output into the sink and then read by the source.


On Mon, May 28, 2018 at 11:26 PM Jan Callewaert <jan.callewaert@xxxxxxxxx> wrote:
Hello,

I am investigating technologies for bulk processing. One of my required use cases is large-scale graph processing. This is supported by Apache Spark as GraphX, or by Apache Flink as iterative algorithms. However, this does not seem to be supported by Apache Beam. Are there any future plans for this, or is there an alternative approach for graph processing offered by Apache Beam?

Regards,

Jan