[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Support for graph processing

I should have been clearer in my original response. Materializing is less efficient but the materialization is helpful for two reasons:
1) You can checkpoint intermediary computations and not lose all the processing if something goes awry.
2) Some Runners have issues with large graphs. For example, Dataflow has job graph submission size limits which typically prevent having graphs with thousands of PTransforms within it (which can easily be hit when loop unrolling).

If you don't think the intermediary materializations are worthwhile and the Runner can handle arbitrarily large graphs, then zero materializations should perform the best.

Also, having "finished" and "unfinished" PCollections makes total sense to me.

Most runners have very minimal overhead for handling transforms with zero data (its just a small amount of record keeping).

On Wed, May 30, 2018 at 2:02 AM Jan Callewaert <jan.callewaert@xxxxxxxxx> wrote:
Thanks for the reply. We are currently looking into the batch pipeline. Your suggested approaches are also the direction that we are investigating, next to using Spark or Flink directly.

Why would you say that it is better to materialize the intermediate state? Is this more/less efficient? Suppose I output my "finished" processing to one collection and I output my "unfinished" data to another PCollection. Will that still be less efficient?

Looking at the data, currently, I would need 6 iterations with an average of 4. However, that of course depends on the data which can evolve over time. So the current, custom, implementation has a hard limit of 15 that I would take over. How does this affect the performance if I have 9 times a NOOP?

Op di 29 mei 2018 om 17:09 schreef Lukasz Cwik <lcwik@xxxxxxxxxx>:
This has been a long requested feature within Apache Beam:

The short story is that this requires a lot of support from execution engines since watermarks become a concept of time + loop iteration (possibly multiple loop iterations if they are nested).

Some solutions right now:
* for a batch pipeline, unroll your own loops a number of times and place filters within each loop iteration to prevent downstream processing from occurring if there is nothing to do. Better yet would be to have your pipeline materialize its intermediate state after some number of loop iterations and then have the driver program relaunch the pipeline consuming the intermediate state if needed.
* for streaming pipeline, create a feedback loop via a source/sink pair like Kafka/Pubsub where the intermediate computation is output into the sink and then read by the source.

On Mon, May 28, 2018 at 11:26 PM Jan Callewaert <jan.callewaert@xxxxxxxxx> wrote:

I am investigating technologies for bulk processing. One of my required use cases is large-scale graph processing. This is supported by Apache Spark as GraphX, or by Apache Flink as iterative algorithms. However, this does not seem to be supported by Apache Beam. Are there any future plans for this, or is there an alternative approach for graph processing offered by Apache Beam?