Category Archives: R

dqsample: A bias-free alternative to base::sample()

For many tasks in statistics and data science it is useful to create a random sample or permutation of a data set. Within R the function base::sample() is used for this task. Unfortunately this function uses a slightly biased algorithm for creating random integers within a given range. Most recently this issue has been discussed in a thread on R-devel, which is also the motivation of the dqsample package. Currently dqsample is not on CRAN, but it can be installed via drat:

Example for the bias

When sampling many random integers the density of odd and even numbers should be roughly equal and constant. However, this is not the case with base::sample:

plot of chunk base

Or with slightly different parameters:

plot of chunk base-oszi

This particular example for the bias was found by Duncan Murdoch.

In dqsample the algorithm suggested by Daniel Lemire (2018, <arXiv:1805.1094>) is used. With this algorithm there is no observable bias between odd and even numbers:

plot of chunk dqsample

Where does the bias come from?

Internally the base::sample() function needs uniformly distributed random integers in an half-open range [0, n). In order to do so, R uses random floating point numbers that are uniformly distributed in [0, 1), multiplies by n and truncates the result to the next smaller integer. This method would be fine, if the random numbers used as starting point would be real numbers in the mathematical sense. However, this is not the case here.

The default random-number generator in R is a 32 bit version of the Mersenne-Twister. It produces random integers uniformly distributed in [0, 2^32), which are then divided by 2^32 to produce doubles in [0, 1). We can now invert the procedure described above to see how many integers are mapped to a certain result. For example, we could simulate rolling ten dice using sample(6, 10, replace = TRUE). Since 2^32 is not a multiple of six, the distribution cannot be completely even:

We see that both one and four are very slightly less likely than the other numbers. This effect gets much more pronounced as the number of items increases from which one can choose. For example, we can use the m from above to see how that uneven distribution of odd and even numbers came about:

Here we see that while only two integers map to any odd number, there are three integers mapped to the even numbers. This pattern shifts half way through the possible results, making the odd numbers more likely, leading to the first image displayed above. As one goes away from m, these pattern shifts occur more rapidly, leading to the oscillatory behaviour seen in the second image. As one moves further away from m, these oscillations happen so rapidly, that a density plot of odd and even numbers looks constant, but the bias is still there. For example, for m - 2^20 one such pattern shift happens between 982 and 983:

Below this point, even numbers are more likely than odd numbers. After this point, the pattern is reversed.

Conclusion

The algorithm used by base::sample() is biased with non-negligible effects when sampling from large data sets. The dqsample package provides an unbiased algorithm for the most common cases. It can be used as a drop-in replacement for the functionality provided by base::sample().

daqana’s R style guide is online

It is an enhanced version of the tidyverse style guide (http://style.tidyverse.org/).

What is a style guide and why do we need it?

A style guide provides programmers with rules specifying how their code should be written. To be functional code has to be in line with a certain grammar and punctuation. As in common language, there is some degree of freedom.

A style guide frames recommendations which restrict these freedoms with the objective of standardization. The recommendation of a certain variant can be rather arbitrary. In other cases, though, one variant may be better suited than others with respect to some quality criteria. Those quality criteria can pertain to readability and understandability, as well as to the capability (performance, scalability) of the code.

The advantages resulting from programmers following a common style guide are thus first the adherence to quality standards, second the ability to cope with code quickly even if written by others or jointly, and not least and as a consequence the easier maintainability and extensibility of the code.

Why is the tidyverse style guide used as a template and why has it been reworked?

The tidyverse comprises a lot of packages, many of them making it easy to increase the quality of one’s analyses. Those packages got extremly popular in no time. The basis for the tidyverse style guide can be found both in the “Google R style guide” (https://google.github.io/styleguide/Rguide.xml), as well as in the excellent package development book by Hadley Wickham (“R packages” http://r-pkgs.had.co.nz/).

The tidyverse style guide incorporates not only a conventional chapter on syntax but also details and How To’s for package development. It is this broad concept paired with the expertise and wide spread of the tidyverse style guide which made us consider it as a guideline for our daily tasks.

In fact, the daqana style guide coincides with the tidyverse style guide for the most part. At particular places explanations have been added to work out the decisions taken, especially where changes have been made. A note has been prepended the chapter regarding the pipe as its use is perceived rather skeptically. A short chapter on unit tests has been added.

The most visual difference is due to the introduction of colorcoding for good and bad examples, thus highly improving user friendlyness. Further, we sought to add hints to fitting lintr functions (https://github.com/jimhester/lintr) immediately in the text where rules are established. Our linters are made available in a file daqana_linters.R facilitating their use with the respective R Studio addin.

The daqana style guide is not considered as complete. We will examine how it proves of value in our daily routine and are planning extensions like a compilation of suitable styler functions (https://github.com/r-lib/styler) in analogy to the linters for upcoming versions.

So, here it is: the first version of the daqana R style guide. Happy coding! https://www.daqana.org/dqstyle-r/

tikzDevice has a new home

Back in February the tikzDevice package became ORPHANED on CRAN. Consequently Kirill Müller and Yihui Xie searched for a new maintainer. When I read about it some time later, we decided that it makes sense for us to step in here. After a brief mail exchange with Yihui Xie the GitHub repository was transfered to our organization and can now be found at https://github.com/daqana/tikzDevice. Meanwhile I have implemented fixes for the existing warning messages from the CRAN checks and uploaded version 0.12, which is currently making its’ way onto CRAN. The next steps will be to work through the existing issues.

What can one do with the tikzDevice? It is a graphics device similar to pdf() or png(). But instead of an image file that might be included in a report as external graphic, it generates files in the TikZ format that makes LaTeX generate the graphic. This enables consistent fonts between text and graphics and TeX’s capabilities for typesetting mathematical equations within graphics. The pdf vignette contains many examples.

One can even use it in a R-markdown document. A document using

in the YAML header and

in a setup chunk will use the same fonts for text and graphics when those are created with dev = "tikz". In this example Palatino with text-figures:

Example Palatino with text-figures

(Unit) Testing Shiny apps using testthat

This blog post explains how to test a Shiny app using shinytest and testthat packages. Basic knowledge about Shiny apps and the principle of unit testing using testthat is useful, but not required here.

Example of a Shiny app

The packages shiny (current version: 1.1.0), testthat (2.0.0) and shinytest (1.3.0) are required for the test presented here and may be installed with install.packages().

Below is a minimal example of a Shiny app (app.R) to be be tested. The app has only a single numerical input and a text output. The entered number n is squared and the result is shown as text.

And this is how the app looks like:

What is shinytest?

The package shinytest provides automatic testing of a Shiny app. Both the “appearance” of the app, as well as its “internal” state during the program flow can be examined. An interactive user interface can be used to create snapshots (more precisely, reference snapshots) as well as a test file. The test file contains the code required for later generation of the snapshots. Each test run creates new snapshots and compares them to the reference snapshots to automatically detect unexpected behavior of the Shiny app. More about the normal workflow with shinytest can be found here. This blog post, however, describes a different approach of testing; Namely testing with shinytest and testtthat combined [*].

[*] Another way of testing uses the function expect_pass (see ?expect_pass), which needs a test file created by shinytest (as well as reference snapshoots) as an input argument. While this allows quick testing, the approach presented here enables more detailed and specific tests.

Testing Shiny apps using shinytest & testthat

shinytest has the class ShinyDriver (see ?ShinyDriver) which opens the Shiny app in a new R-Session as well as an instance of PhantomJS, and connects the two. PhantomJS is a headless web browser that can be operated by JavaScript. The ShinyDriver object is equipped with various methods that enable, among other things, setting/getting values of different variables (inputs or outputs) in the Shiny app. That way, we can assign arbitrary values to the input variables, “manually” (without the usual user interface of the Shiny app), and then get the output variables.

Example of a test

In the following test, the variable num_input is set to 30 and consequently the variable text_out is tested to see if it becomes the string “The square of the number n is: n² = 900”. More on testing with testthat can be found here.

Using the expectation functions of the package testthat it is thus easily possible to test the functionalities of the Shiny app. An advantage of this is that when calling devtools::test() both the tests of the Shiny app and other unit tests are taken into account.

Deeper insights – Exported variables and HTML widgets

Within the server function, we can also define new variables (in addition to the usual inputs and outputs) and export them to be “visible” for shinytest and allow for more detailed examination of the app’s workflows. As an example for the Shiny app shown above, we can save a list of all the numbers n entered and export them as a variable inputs_list (please see the code below).

For different HTML widgets the method findElement can be used via app$findElement(xpath ="here the XPATH") using the XPath parameter. For example, if notifications are used with the showNotification() function in the Shiny app, they can be identified with xpath = "//*[@id=\"shiny-notification-panel\"]" and can be tested correspondingly.

Here is how to export a variable in the Shiny app and display notifications using showNotification().

For example, a test might look like this:

First CRAN release for dqrng

The dqrng package is now available from CRAN. It is possible to install it using

Besides this simplified installation the included RNGs have been updated: Xorshift128+ and Xorshift1024* have been removed in favor of the new Xorshiro256+, c.f. http://xoshiro.di.unimi.it/. Using the provided RNGs from R is unchanged: