git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: cassandra-stress HexStrings generator


Yes, I’m pretty sure you understood correctly (I wrote most of this, but it’s been a long time so I cannot remember much for certain).  

It should be implemented like the Strings generator.  It looks like both HexStrings and HexBytes are incorrect, and have been for a long time.


> On 12 Dec 2018, at 22:27, Saleil Bhat (BLOOMBERG/ 731 LEX) <sbhat39@xxxxxxxxxxxxx> wrote:
> 
> Hi, 
> 
> I have a question about the behavior of the HexStrings value generator in the cassandra-stress tool, particularly concerning its population/identity distribution.  
> 
> 
> Per the discussion in JIRA item CASSANDRA-6146 concerning the stress YAML profile, the population field in a columnspec “represents the total unique population distribution of that column across rows.”
> 
> 
> I interpreted this to mean that if I specify some distribution 'F' for a column, then the probability of occurrence for each potential value of that column is given by 'F'. 
> 
> So, for example, if I provided the following columnspec for a text column: 
>  name: fake_column 
>           size: fixed(32) 
>     population: gaussian(1..100)  
> and then generated a large amount of data according to this specification, 
> I would expect there to be 100 distinct values for ‘fake_column’, and that a histogram of the frequency of occurrence of each value would be roughly bell-shaped. 
> 
> 
> 
> However, the current implementation of the HexStrings generator deviates from this expectation. In the current implementation, each CHARACTER in the string is drawn from F, rather than the string as a whole. Therefore, if you plot the histogram of frequency of occurrence for each character, you get a bell-shaped curve, but the distribution of the occurrences of whole strings (the actual columns) is something else. 
> 
> 
> My question is, is this the desired behavior for string columns? Was my expectation/interpretation incorrect? If so, can anyone give some insight as to why strings are designed to behave this way and what the use case is for this behavior? 
> 
> Thanks, 
> -Saleil 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx