Author Archives: Ralf Stubner

dqrng v0.0.5: New and updated RNGs

A new version of dqrng has made it onto the CRAN servers after a brief hick-up. Thanks to the CRAN team in general and Uwe Ligges in particular for their relentless efforts.

This versions adds a new RNG to be used together with the provided distribution functions: The 64 bit version of the 20 rounds Threefry engine (Salmon et al., 2011 <doi:10.1145/2063384.2063405>, http://www.thesalmons.org/john/random123/) from the sitmo package. Thanks to James Balamuta for adding the C++11 version of Threefry to his package.

In addition, the PCG headers have been updated to the most recent version. Thanks to Aaron Lun for the heads-up and the pointer to two packages on BioConductor that are using dqrng: DropletUtils and scran.

There have also been two more technical changes: On the one hand the package is now prepared for the future, since -DSTRICT_R_HEADERS is added to PKG_CPPFLAGS in src/Makevars.

On the other hand I have added unit tests for the C++ interface. I first tried Catch since I am using testthat already. However, I did not like compiling test code into the final library. Instead I have added C++ files that are compiled using Rcpp::sourceCpp() and run during the test. This is also a more realistic test of the C++ interface.

dqsample: A bias-free alternative to base::sample()

For many tasks in statistics and data science it is useful to create a random sample or permutation of a data set. Within R the function base::sample() is used for this task. Unfortunately this function uses a slightly biased algorithm for creating random integers within a given range. Most recently this issue has been discussed in a thread on R-devel, which is also the motivation of the dqsample package. Currently dqsample is not on CRAN, but it can be installed via drat:

Example for the bias

When sampling many random integers the density of odd and even numbers should be roughly equal and constant. However, this is not the case with base::sample:

plot of chunk base

Or with slightly different parameters:

plot of chunk base-oszi

This particular example for the bias was found by Duncan Murdoch.

In dqsample the algorithm suggested by Daniel Lemire (2018, <arXiv:1805.1094>) is used. With this algorithm there is no observable bias between odd and even numbers:

plot of chunk dqsample

Where does the bias come from?

Internally the base::sample() function needs uniformly distributed random integers in an half-open range [0, n). In order to do so, R uses random floating point numbers that are uniformly distributed in [0, 1), multiplies by n and truncates the result to the next smaller integer. This method would be fine, if the random numbers used as starting point would be real numbers in the mathematical sense. However, this is not the case here.

The default random-number generator in R is a 32 bit version of the Mersenne-Twister. It produces random integers uniformly distributed in [0, 2^32), which are then divided by 2^32 to produce doubles in [0, 1). We can now invert the procedure described above to see how many integers are mapped to a certain result. For example, we could simulate rolling ten dice using sample(6, 10, replace = TRUE). Since 2^32 is not a multiple of six, the distribution cannot be completely even:

We see that both one and four are very slightly less likely than the other numbers. This effect gets much more pronounced as the number of items increases from which one can choose. For example, we can use the m from above to see how that uneven distribution of odd and even numbers came about:

Here we see that while only two integers map to any odd number, there are three integers mapped to the even numbers. This pattern shifts half way through the possible results, making the odd numbers more likely, leading to the first image displayed above. As one goes away from m, these pattern shifts occur more rapidly, leading to the oscillatory behaviour seen in the second image. As one moves further away from m, these oscillations happen so rapidly, that a density plot of odd and even numbers looks constant, but the bias is still there. For example, for m - 2^20 one such pattern shift happens between 982 and 983:

Below this point, even numbers are more likely than odd numbers. After this point, the pattern is reversed.

Conclusion

The algorithm used by base::sample() is biased with non-negligible effects when sampling from large data sets. The dqsample package provides an unbiased algorithm for the most common cases. It can be used as a drop-in replacement for the functionality provided by base::sample().

tikzDevice has a new home

Back in February the tikzDevice package became ORPHANED on CRAN. Consequently Kirill Müller and Yihui Xie searched for a new maintainer. When I read about it some time later, we decided that it makes sense for us to step in here. After a brief mail exchange with Yihui Xie the GitHub repository was transfered to our organization and can now be found at https://github.com/daqana/tikzDevice. Meanwhile I have implemented fixes for the existing warning messages from the CRAN checks and uploaded version 0.12, which is currently making its’ way onto CRAN. The next steps will be to work through the existing issues.

What can one do with the tikzDevice? It is a graphics device similar to pdf() or png(). But instead of an image file that might be included in a report as external graphic, it generates files in the TikZ format that makes LaTeX generate the graphic. This enables consistent fonts between text and graphics and TeX’s capabilities for typesetting mathematical equations within graphics. The pdf vignette contains many examples.

One can even use it in a R-markdown document. A document using

in the YAML header and

in a setup chunk will use the same fonts for text and graphics when those are created with dev = "tikz". In this example Palatino with text-figures:

Example Palatino with text-figures

First CRAN release for dqrng

The dqrng package is now available from CRAN. It is possible to install it using

Besides this simplified installation the included RNGs have been updated: Xorshift128+ and Xorshift1024* have been removed in favor of the new Xorshiro256+, c.f. http://xoshiro.di.unimi.it/. Using the provided RNGs from R is unchanged:

Fast Random Numbers for R with dqrng

If you need only a few truly random numbers you might use dice or atmospheric noise. However, if you need many random numbers you will have to use a pseudo random number generator (RNG). R includes support for different RNGs (c.f. ?Random) and a wide variety of distributions (c.f. ?distributions). The underlying methods have been well tested, but faster methods are available. The dqrng package provides fast random number generators with good statistical properties for usage with R. It combines these RNGs with fast distribution functions to sample from uniform, normal or exponential distributions.

Installation

At the moment dqrng is not on CRAN, but you can install the current version via drat:

Usage

Using the provided RNGs from R is deliberately similar to using R’s build-in RNGs:

They are quite a bit faster, though, as we can see by comparing 10 million random draws from different distributions:

expr min lq mean median uq max neval cld
runif 248.16730 251.83371 262.20559 260.33073 265.69415 322.15771 100 d
dqrunif 34.77413 35.44569 39.40738 36.82459 38.42524 109.96758 100 a
rnorm 587.40975 596.92850 618.79356 613.08345 624.31043 706.79528 100 f
dqrnorm 63.17649 64.43796 68.77696 66.80184 68.39577 141.97466 100 c
rexp 392.79228 397.48715 413.66996 411.14180 420.42473 494.49631 100 e
dqrexp 52.75875 53.64510 57.15006 55.80021 58.65553 79.11577 100 b

plot of chunk unnamed-chunk-4

For r* the default Mersenne-Twister was used, while dqr* used Xoroshiro128+ in this comparison. For rnorm the default inversion method was used, while dqrnorm (and dqrexp) used the Ziggurat algorithm from Boost.Random with additional tuning.

Both the RNGs and the distribution functions are distributed as C++ header-only library. See the included vignette for possible usage from C++.

Supported Random Number Generators

Support for the following 64 bit RNGs is currently included:

  • Mersenne-Twister
    The 64 bit variant of the well-known Mersenne-Twister, which is also used as default. This is a conservative default that allows you to take advantage of the fast distribution functions provided by dqrng while staying close to R’s default RNG (32 bit Mersenne-Twister).
  • pcg64
    The default 64 bit variant from the PCG family developed by Melissa O’Neill. See http://www.pcg-random.org for more details.
  • Xoroshiro128+, Xorshift128+, and Xorshift1024*
    RNGs mainly developed by Sebastiano Vigna. They are used as default RNGs in Erlang and different JavaScript engines. See http://xoroshiro.di.unimi.it/ for more details.