git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Understanding GenerateSequence and SideInputs


The runner is responsible for scheduling the work anywhere it chooses. It can be the same node all the time or different nodes.

There is no precision guarantee on the upper bound (only the lower bound), the withRate method states that it will "generate at most a given number of elements per a given period". This is because a DoFn can't control whether and when the runner decides to schedule the work. A runner will attempt to honor any processing commitments that it knows about such as timers but if the runner has too much work and too few resources it may fall behind or decide to group small work units into larger work units for performance reasons.



On Thu, May 24, 2018 at 1:11 PM Carlos Alonso <carlos@xxxxxxxxxxxxx> wrote:
Hi everyone!!

I'm building a pipeline to store streaming data into BQ and I'm using the pattern: Slowly changing lookup cache described here: https://cloud.google.com/blog/big-data/2017/06/guide-to-common-cloud-dataflow-use-case-patterns-part-1 to hold and refresh the table schemas (as they may change from time to time).

Now I'd like to understand how that is scheduled on a distributed system. Who is running that code? One random node? One node but always the same? All nodes?

Also, what are the GenerateSequence guarantees in terms of precision? I have it configured to generate 1 element every 5 minutes and most of the time it works exact, but sometimes it doesn't... Is that expected?

Regards