git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Understanding GenerateSequence and SideInputs


Many thanks for your comments. Really appreciate it!!

I'm not sure I understood the GenerateSequence guarantees explanation. By "it will generate at most a given number..." you mean that it won't generate more than the given number per period, right? If that was the case maybe we need to review it closely as that's exactly what I've seen (I configured 1 element every 5 minutes and it was consistently emitting every 5 minutes plus a few millis, except for a time that it emitted after 4 minutes and some seconds.

About the Square Enix link, many thanks Reza, that's something we are following for our implementation.

Thanks everyone again!

On Fri, May 25, 2018 at 1:34 AM Reza Rokni <rez@xxxxxxxxxx> wrote:
Hi,

Not sure if this is useful for your use case but as you are using BQ with a changing schema the following may also be a interesting read ...


Cheers

Reza



On Fri, May 25, 2018, 5:50 AM Raghu Angadi <rangadi@xxxxxxxxxx> wrote:

On Thu, May 24, 2018 at 1:11 PM Carlos Alonso <carlos@xxxxxxxxxxxxx> wrote:
Hi everyone!!

I'm building a pipeline to store streaming data into BQ and I'm using the pattern: Slowly changing lookup cache described here: https://cloud.google.com/blog/big-data/2017/06/guide-to-common-cloud-dataflow-use-case-patterns-part-1 to hold and refresh the table schemas (as they may change from time to time).

Now I'd like to understand how that is scheduled on a distributed system. Who is running that code? One random node? One node but always the same? All nodes?

GenerateSequence() is uses an unbounded source. Like any unbounded source, it can has a set of 'splits' ('desiredNumSplits' [1] is set by runtime). Each of the splits run in parallel.. a typical runtime distributes these across the workers. Typically they stay on a worker unless there is a reason to redistribute (autoscaling, workers unresponsive etc). W.r.t. user application there are no guarantees about affinity. 

 

Also, what are the GenerateSequence guarantees in terms of precision? I have it configured to generate 1 element every 5 minutes and most of the time it works exact, but sometimes it doesn't... Is that expected?

Each of the splits mentioned above essentially runs 'advance() [2]' in a loop. It check current walltime to decide if it needs to emit next element. If the value you see off by a few seconds, it would imply 'advance()' was not called during that time by the framework. Runtime frameworks usually don't provide any hard or soft deadlines for scheduling work.



Regards