git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Performance issue in Beam 2.4 onwards


Hi,

Do you use specific/complex coders in your pipeline ?

I'm sure Eugene will propose some insights about this change: AFAIR, the
purpose is to have a cleaner use of coders and identify identity copy.

Regards
JB

On 09/07/2018 16:22, Vojtech Janota wrote:
> Hi,
> 
> We are using Apache Beam in our project for some time now. Since our
> datasets are of modest size, we have so far used DirectRunner as the
> computation easily fits onto a single machine. Recently we upgraded Beam
> from 2.2 to 2.4 and found out that performance of our pipelines
> drastically deteriorated. Pipelines that took ~3 minutes with 2.2 do not
> finish within hours now. We tried to isolate the change that causes the
> slowdown and came to the commits into the "InMemoryStateInternals" class:
> 
> * https://github.com/apache/beam/commit/32a427c
> * https://github.com/apache/beam/commit/8151d82
> 
> In a nutshell where previously the copy() method simply assigned:
> 
>   that.value = this.value
> 
> There is now coder encode/decode combo hidden behind:
> 
>   that.value = uncheckedClone(coder, this.value)
> 
> Can somebody explain the purpose of this change? Is it meant as an
> additional "enforcement" point, similar to DirectRunner's
> enforceImmutability and enforceEncodability? Or is it something that is
> genuinely needed to provide correct behaviour of the pipeline?
> 
> Any hints or thoughts are appreciated.
> 
> Regards,
> Vojta
> 
>  
> 
> 
> 

-- 
Jean-Baptiste Onofré
jbonofre@xxxxxxxxxx
http://blog.nanthrax.net
Talend - http://www.talend.com