git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Multimap PCollectionViews' values udpated rather than appended


Yes, this is a known issue. Here's a prior discussion: https://lists.apache.org/thread.html/e9518f5d5f4bcf7bab02de2cb9fe1bd5293d87aa12d46de1eac4600b@%3Cuser.beam.apache.org%3E

It is actually long-standing and the solution is known but hard.



On Wed, May 30, 2018 at 9:48 AM Carlos Alonso <carlos@xxxxxxxxxxxxx> wrote:
Hi everyone!!

Working with multimap based side inputs on the global window I'm experiencing something unexpected (at least to me) that I'd like to share with you to clarify.

The way I understand multimaps is that when one emits two values for the same key for the same window (obvious thing here as I'm working on the Global one), the newly emitted values are appended to the Iterable collection that is the value for that particular key on the map.

Testing it in this job (it is using scio, but side inputs are implemented with PCollectionViews): https://github.com/calonso/beam_experiments/blob/master/refreshingsideinput/src/main/scala/com/mrcalonso/RefreshingSideInput2.scala

The steps to reproduce are:
1. Create one table on the target BQ
2. Run the job
3. Patch the table on BQ (add one field), this should generate a new TableSchema for the corresponding TableReference
4. An updated value of the fields number appear on the logs, but there is only one element within the iterable, as if it had been updated instead of appended!!

Is that the expected behaviour? Is a bug? Am I missing something?

Thanks!