Author Archives: Ralf Stubner

Fast sampling support in dqrng

About half a year ago I wrote about dqsample providing a bias free alternative to base::sample(). Meanwhile this is mostly of historic interest, since the upcomming R release 3.6.0 will see an updated sampling algorithm. However, some of the techniques used in dqsample are now part of dqrng. Of course, the sampling method is still free of the previously described bias:

This most recent version of dqrng is not on CRAN yet, but you can install it via drat:

Overall we will see that it is possible to improve on R’s sampling speed without compromising statistical qualities. Feedback is always welcome!

Benchmarks

The following benchmarks were made with R version 3.5.3 (2019-03-11) running under Debian GNU/Linux 9 (stretch) on a Intel® Core™ i7-6600U CPU @ 2.60GHz. YMMV, as always with benchmarks …

By combining fast RNGs with fast methods for creating integers in a range one gets good performance for sampling with replacement:

expression min mean median max itr/sec
sample.int(m, n, replace = TRUE) 94.9µs 112.5µs 112.2µs 3.78ms 8886.863
sample.int(10000 * m, n, replace = TRUE) 170.9µs 202.8µs 200.9µs 3.08ms 4930.094
dqsample.int(m, n, replace = TRUE) 31.5µs 36.2µs 35.7µs 183.63µs 27596.508
dqsample.int(10000 * m, n, replace = TRUE) 35.7µs 41.3µs 39.8µs 3.06ms 24186.338

Note that sampling from 10^10 integers triggers “long-vector support” in R.

When sampling without replacement one has to consider an appropriate algorithm for making sure that no entry is repeated. When more than 50% of the population are sampled, dqrng shuffles an appropriate part of the full list and returns that. The algorithm used in R is similar but dqrng has the edge with respect to performance:

expression min mean median max itr/sec
sample.int(m, n) 11.91ms 14.09ms 13.02ms 26.11ms 70.98544
dqsample.int(m, n) 6.93ms 8.22ms 8.17ms 9.99ms 121.65317

For lower sampling ratios a set based rejection sampling algorithm is used by dqrng. In principle, R can make use of a similar algorithm based on a hashset. However, it is only used for larger input vectors even though it is faster than the default method. The algorithm in dqrng, which is based on a bitset, is even faster, though:

expression min mean median max itr/sec
sample.int(m, n) 788µs 1.72ms 1.87ms 5.01ms 583.0849
sample.int(m, n, useHash = TRUE) 229µs 269.6µs 266.15µs 453.71µs 3709.2196
dqsample.int(m, n) 113µs 130.4µs 129.49µs 2.98ms 7668.6069

As one decreases the sampling rate even more, dqrng switches to a hashset based rejection sampling. Both hashset based methods have similar performance and are much faster than R’s default method.

expression min mean median max itr/sec
sample.int(m, n) 452.72µs 1.29ms 1.56ms 4.69ms 777.0096
sample.int(m, n, useHash = TRUE) 3.98µs 5.38µs 5.03µs 62.94µs 185978.0662
dqsample.int(m, n) 3.59µs 4.14µs 4.05µs 26.85µs 241454.9528

For larger sampling ranges R uses the hashset by default, though dqsample.int is still faster:

expression min mean median max itr/sec
sample.int(m, n) 5.3ms 5.79ms 5.48ms 9.5ms 172.7915
dqsample.int(m, n) 1.7ms 2.04ms 1.97ms 5.08ms 490.8562

Details

The following methods are used for sampling without replacement. The algorithms are presented in R-like pseudocode, even though the real implementation is in C++. For sampling rates above 50%, a partial Fisher-Yates shuffle is used:

where random_int(m-i) returns a random integer in [0, m-i]. Since the full population is kept in memory, this method is only suitable for high selection rates. One could expect that reservoir sampling should work well for lower selection rates. However, in my tests set based algorithms were faster:

Here elems.insert(v) returns TRUE if the insert was successful, i.e. v was not in elems before, and FALSE otherwise. There are different strategies for implementing such a set. For intermediate sampling rates (currently between 0.1% and 50%) dqrng uses a bitset, i.e. a vector of m bits each representing one of the possible values. For lower sampling rates the memory usage of this algorithm is to expensive, which is why a hashset1 is used, since there the used memory scales with n and not with m. One could expect that Robert Floyd’s sampling algorithm would be superior, but this was not the case in my tests, probably because it requires a final shuffling of the result to get a random permutation instead of a random combination.

  1. For the specialists: Open addressing with a power-of-two size between 1.5 and 3 times n, identity hash function for the stored integers and quadratic probing.

dqrng v0.1.0: breaking changes

A new version of dqrng has made it onto the CRAN servers. This version brings two breaking changes, hence the “larger than usual” change in version number:

  • An integer vector instead of a single int is used for seeding (Aaron Lun in #10)
    • Single integer seeds lead to a different RNG state than before.
    • dqrng::dqset_seed() expects a Rcpp::IntegerVector instead of an int
  • Support for Mersenne-Twister has been removed, Xoroshiro128+ is now the default.

The first change is motivated by the desire to provide more than 32 bits of randomness as seed to the RNG. With this possibility in place, the previously used scrambling of the single 32 bit integer did not make much sense anymore and was therefore removed. The new method generateSeedVectors() for generating a list of random int vectors from R’s RNG can be used to generate such seed vector.

The second change is related to a statement in R’s manual: Nor should the C++11 random number library be used …. I think that relates to the implementation-defined distributions and not the generators, but in general one should follow WRE by the letter. So std::mt19937_64 has to go, and unfortunately it cannot be replaced by boost::random::mt19937_64 due to a not-merged pull request. Instead of shipping a fixed version of MT I opted for removal since:

  • MT is known to fail certain statistical tests.
  • MT is slower than the other generators.
  • MT was the only generator that does not support multiple streams of random numbers necessary for parallel operations.

The other usage of random from C++11 was the default initialization, which used std::random_device. This is now unnecessary since the initial state of the default RNG is now based on R’s RNG, using the techniques developed for generateSeedVectors().

dqrng v0.0.5: New and updated RNGs

A new version of dqrng has made it onto the CRAN servers after a brief hick-up. Thanks to the CRAN team in general and Uwe Ligges in particular for their relentless efforts.

This versions adds a new RNG to be used together with the provided distribution functions: The 64 bit version of the 20 rounds Threefry engine (Salmon et al., 2011 <doi:10.1145/2063384.2063405>, http://www.thesalmons.org/john/random123/) from the sitmo package. Thanks to James Balamuta for adding the C++11 version of Threefry to his package.

In addition, the PCG headers have been updated to the most recent version. Thanks to Aaron Lun for the heads-up and the pointer to two packages on BioConductor that are using dqrng: DropletUtils and scran.

There have also been two more technical changes: On the one hand the package is now prepared for the future, since -DSTRICT_R_HEADERS is added to PKG_CPPFLAGS in src/Makevars.

On the other hand I have added unit tests for the C++ interface. I first tried Catch since I am using testthat already. However, I did not like compiling test code into the final library. Instead I have added C++ files that are compiled using Rcpp::sourceCpp() and run during the test. This is also a more realistic test of the C++ interface.

dqsample: A bias-free alternative to base::sample()

For many tasks in statistics and data science it is useful to create a random sample or permutation of a data set. Within R the function base::sample() is used for this task. Unfortunately this function uses a slightly biased algorithm for creating random integers within a given range. Most recently this issue has been discussed in a thread on R-devel, which is also the motivation of the dqsample package. Currently dqsample is not on CRAN, but it can be installed via drat:

Example for the bias

When sampling many random integers the density of odd and even numbers should be roughly equal and constant. However, this is not the case with base::sample:

plot of chunk base

Or with slightly different parameters:

plot of chunk base-oszi

This particular example for the bias was found by Duncan Murdoch.

In dqsample the algorithm suggested by Daniel Lemire (2018, <arXiv:1805.1094>) is used. With this algorithm there is no observable bias between odd and even numbers:

plot of chunk dqsample

Where does the bias come from?

Internally the base::sample() function needs uniformly distributed random integers in an half-open range [0, n). In order to do so, R uses random floating point numbers that are uniformly distributed in [0, 1), multiplies by n and truncates the result to the next smaller integer. This method would be fine, if the random numbers used as starting point would be real numbers in the mathematical sense. However, this is not the case here.

The default random-number generator in R is a 32 bit version of the Mersenne-Twister. It produces random integers uniformly distributed in [0, 2^32), which are then divided by 2^32 to produce doubles in [0, 1). We can now invert the procedure described above to see how many integers are mapped to a certain result. For example, we could simulate rolling ten dice using sample(6, 10, replace = TRUE). Since 2^32 is not a multiple of six, the distribution cannot be completely even:

We see that both one and four are very slightly less likely than the other numbers. This effect gets much more pronounced as the number of items increases from which one can choose. For example, we can use the m from above to see how that uneven distribution of odd and even numbers came about:

Here we see that while only two integers map to any odd number, there are three integers mapped to the even numbers. This pattern shifts half way through the possible results, making the odd numbers more likely, leading to the first image displayed above. As one goes away from m, these pattern shifts occur more rapidly, leading to the oscillatory behaviour seen in the second image. As one moves further away from m, these oscillations happen so rapidly, that a density plot of odd and even numbers looks constant, but the bias is still there. For example, for m - 2^20 one such pattern shift happens between 982 and 983:

Below this point, even numbers are more likely than odd numbers. After this point, the pattern is reversed.

Conclusion

The algorithm used by base::sample() is biased with non-negligible effects when sampling from large data sets. The dqsample package provides an unbiased algorithm for the most common cases. It can be used as a drop-in replacement for the functionality provided by base::sample().

tikzDevice has a new home

Back in February the tikzDevice package became ORPHANED on CRAN. Consequently Kirill Müller and Yihui Xie searched for a new maintainer. When I read about it some time later, we decided that it makes sense for us to step in here. After a brief mail exchange with Yihui Xie the GitHub repository was transfered to our organization and can now be found at https://github.com/daqana/tikzDevice. Meanwhile I have implemented fixes for the existing warning messages from the CRAN checks and uploaded version 0.12, which is currently making its’ way onto CRAN. The next steps will be to work through the existing issues.

What can one do with the tikzDevice? It is a graphics device similar to pdf() or png(). But instead of an image file that might be included in a report as external graphic, it generates files in the TikZ format that makes LaTeX generate the graphic. This enables consistent fonts between text and graphics and TeX’s capabilities for typesetting mathematical equations within graphics. The pdf vignette contains many examples.

One can even use it in a R-markdown document. A document using

in the YAML header and

in a setup chunk will use the same fonts for text and graphics when those are created with dev = "tikz". In this example Palatino with text-figures:

Example Palatino with text-figures