Fast sampling support in dqrng

About half a year ago I wrote about dqsample providing a bias free alternative to base::sample(). Meanwhile this is mostly of historic interest, since the upcomming R release 3.6.0 will see an updated sampling algorithm. However, some of the techniques used in dqsample are now part of dqrng. Of course, the sampling method is still free of the previously described bias:

This most recent version of dqrng is not on CRAN yet, but you can install it via drat:

Overall we will see that it is possible to improve on R’s sampling speed without compromising statistical qualities. Feedback is always welcome!

Benchmarks

The following benchmarks were made with R version 3.5.3 (2019-03-11) running under Debian GNU/Linux 9 (stretch) on a Intel® Core™ i7-6600U CPU @ 2.60GHz. YMMV, as always with benchmarks …

By combining fast RNGs with fast methods for creating integers in a range one gets good performance for sampling with replacement:

expression min mean median max itr/sec
sample.int(m, n, replace = TRUE) 94.9µs 112.5µs 112.2µs 3.78ms 8886.863
sample.int(10000 * m, n, replace = TRUE) 170.9µs 202.8µs 200.9µs 3.08ms 4930.094
dqsample.int(m, n, replace = TRUE) 31.5µs 36.2µs 35.7µs 183.63µs 27596.508
dqsample.int(10000 * m, n, replace = TRUE) 35.7µs 41.3µs 39.8µs 3.06ms 24186.338

Note that sampling from 10^10 integers triggers “long-vector support” in R.

When sampling without replacement one has to consider an appropriate algorithm for making sure that no entry is repeated. When more than 50% of the population are sampled, dqrng shuffles an appropriate part of the full list and returns that. The algorithm used in R is similar but dqrng has the edge with respect to performance:

expression min mean median max itr/sec
sample.int(m, n) 11.91ms 14.09ms 13.02ms 26.11ms 70.98544
dqsample.int(m, n) 6.93ms 8.22ms 8.17ms 9.99ms 121.65317

For lower sampling ratios a set based rejection sampling algorithm is used by dqrng. In principle, R can make use of a similar algorithm based on a hashset. However, it is only used for larger input vectors even though it is faster than the default method. The algorithm in dqrng, which is based on a bitset, is even faster, though:

expression min mean median max itr/sec
sample.int(m, n) 788µs 1.72ms 1.87ms 5.01ms 583.0849
sample.int(m, n, useHash = TRUE) 229µs 269.6µs 266.15µs 453.71µs 3709.2196
dqsample.int(m, n) 113µs 130.4µs 129.49µs 2.98ms 7668.6069

As one decreases the sampling rate even more, dqrng switches to a hashset based rejection sampling. Both hashset based methods have similar performance and are much faster than R’s default method.

expression min mean median max itr/sec
sample.int(m, n) 452.72µs 1.29ms 1.56ms 4.69ms 777.0096
sample.int(m, n, useHash = TRUE) 3.98µs 5.38µs 5.03µs 62.94µs 185978.0662
dqsample.int(m, n) 3.59µs 4.14µs 4.05µs 26.85µs 241454.9528

For larger sampling ranges R uses the hashset by default, though dqsample.int is still faster:

expression min mean median max itr/sec
sample.int(m, n) 5.3ms 5.79ms 5.48ms 9.5ms 172.7915
dqsample.int(m, n) 1.7ms 2.04ms 1.97ms 5.08ms 490.8562

Details

The following methods are used for sampling without replacement. The algorithms are presented in R-like pseudocode, even though the real implementation is in C++. For sampling rates above 50%, a partial Fisher-Yates shuffle is used:

where random_int(m-i) returns a random integer in [0, m-i]. Since the full population is kept in memory, this method is only suitable for high selection rates. One could expect that reservoir sampling should work well for lower selection rates. However, in my tests set based algorithms were faster:

Here elems.insert(v) returns TRUE if the insert was successful, i.e. v was not in elems before, and FALSE otherwise. There are different strategies for implementing such a set. For intermediate sampling rates (currently between 0.1% and 50%) dqrng uses a bitset, i.e. a vector of m bits each representing one of the possible values. For lower sampling rates the memory usage of this algorithm is to expensive, which is why a hashset1 is used, since there the used memory scales with n and not with m. One could expect that Robert Floyd’s sampling algorithm would be superior, but this was not the case in my tests, probably because it requires a final shuffling of the result to get a random permutation instead of a random combination.

  1. For the specialists: Open addressing with a power-of-two size between 1.5 and 3 times n, identity hash function for the stored integers and quadratic probing.

A small food survey – Results

Late last year, we conducted a small (non-representative) food survey: Which food is most (un-)popular, are there different “food types”, or do food preferences correlate with traveling activities?

The results

Now, we present our findings. Most of the 60 participants in total were between 20 – 40 years old. A strikingly high fraction of more than 15 % indicated a vegan nutrition. The ratio between female and male participants was even. On average, the participants liked around 60 % of the prompted food items. (The links below are in German only, but the graphics speak for themselves!)

Wholemeal buns, oranges and chocolate were the most popular food items. Around 17 % had never eaten algae. For some of the prompted food items, positive preference changed to aversion. Noticeably, this change of mind was mainly indicated for vegetables, with olives leading the way. Overall, vegan participants were more open to the prompted food items than others.

Finally, there were certain connections between food preference and personal attributes, like favorite colors or visited continents. On the one hand, for example, ginger was particularly popular among people who have been to Northern America. On the other hand, non-wholemeal buns were unpopular among people travelling South America. For more results and infos, please click one of the adjacent links.

See you soon and best!

dqrng v0.1.0: breaking changes

A new version of dqrng has made it onto the CRAN servers. This version brings two breaking changes, hence the “larger than usual” change in version number:

  • An integer vector instead of a single int is used for seeding (Aaron Lun in #10)
    • Single integer seeds lead to a different RNG state than before.
    • dqrng::dqset_seed() expects a Rcpp::IntegerVector instead of an int
  • Support for Mersenne-Twister has been removed, Xoroshiro128+ is now the default.

The first change is motivated by the desire to provide more than 32 bits of randomness as seed to the RNG. With this possibility in place, the previously used scrambling of the single 32 bit integer did not make much sense anymore and was therefore removed. The new method generateSeedVectors() for generating a list of random int vectors from R’s RNG can be used to generate such seed vector.

The second change is related to a statement in R’s manual: Nor should the C++11 random number library be used …. I think that relates to the implementation-defined distributions and not the generators, but in general one should follow WRE by the letter. So std::mt19937_64 has to go, and unfortunately it cannot be replaced by boost::random::mt19937_64 due to a not-merged pull request. Instead of shipping a fixed version of MT I opted for removal since:

  • MT is known to fail certain statistical tests.
  • MT is slower than the other generators.
  • MT was the only generator that does not support multiple streams of random numbers necessary for parallel operations.

The other usage of random from C++11 was the default initialization, which used std::random_device. This is now unnecessary since the initial state of the default RNG is now based on R’s RNG, using the techniques developed for generateSeedVectors().

dqrng v0.0.5: New and updated RNGs

A new version of dqrng has made it onto the CRAN servers after a brief hick-up. Thanks to the CRAN team in general and Uwe Ligges in particular for their relentless efforts.

This versions adds a new RNG to be used together with the provided distribution functions: The 64 bit version of the 20 rounds Threefry engine (Salmon et al., 2011 <doi:10.1145/2063384.2063405>, http://www.thesalmons.org/john/random123/) from the sitmo package. Thanks to James Balamuta for adding the C++11 version of Threefry to his package.

In addition, the PCG headers have been updated to the most recent version. Thanks to Aaron Lun for the heads-up and the pointer to two packages on BioConductor that are using dqrng: DropletUtils and scran.

There have also been two more technical changes: On the one hand the package is now prepared for the future, since -DSTRICT_R_HEADERS is added to PKG_CPPFLAGS in src/Makevars.

On the other hand I have added unit tests for the C++ interface. I first tried Catch since I am using testthat already. However, I did not like compiling test code into the final library. Instead I have added C++ files that are compiled using Rcpp::sourceCpp() and run during the test. This is also a more realistic test of the C++ interface.

A small food survey

This post is about a small side project: A (non-representative) food survey. Which foods are (un-)popular? Do food preferences correlate with traveling activities? Are there different “food types”? And many more.

The idea

During lunch, naturally, we often talk about food. About different tastes, likes and dislikes. How (un-)popular are certain foods, are there different “types” of eaters, or do travelling destinations correlate with food preferences? To shed some light onto these questions, because we did not find accessible studies, and with the tools of web programming and statistics at hand, we decided to start a fun side project.

Now we present you a small food survey! Since we do not select a random sample, the survey does not allow representative statements. Its findings only hold for the group of participants, not for Germany and certainly not on a global scale. Nevertheless, we hope that both we and you enjoy our survey and find some interesting connections.

Click here to start the survey (in German only)!

Additional infos

After going through three medium-short pages of questions you will see some preliminary results, and how your answers compare to those of others. Moreover:

  • Of course, all data are fully anonymous and will not be used for any commercial purpose.
  • Throughout the course of the survey, we will update this blog post to keep you informed.
  • Please feel free to share this survey as you wish.

If there are any questions or comments, please send us a message. Enjoy!