Chapter 3 Working with repro

The repro package is designed to streamline the researchers’ workflow. It helps researchers to set up, create, reproduce and change an analysis with little more than a simple mental model. To that end, it stands on the shoulders of giants and provides only a minimal layer of abstraction for the tools as mentioned earlier. While there is a considerable variety in how researchers can approach an empirical study and its analysis, Figure 3.1 offers an idealized workflow.

Schematic illustration of a reproducible workflow.

Figure 3.1: Schematic illustration of a reproducible workflow.

This workflow can be approached from two perspectives, that of an author of a reproducible analysis or as a contributor. I use the term contributor in its broadest possible meaning, because—in my view—even a reader who is thorough enough to reproduce an analysis contributes something of value. Consequently, I include everyone with interest in reproducing the analysis in the group of collaborators, be it a coauthor, a reviewer or the very person who created the analysis at an earlier point in time.

From the contributor’s point of view repro acts as an assistance system, advising users on how to set up their computers and how to reproduce an analysis. It is a major difficulty for untrained users to detect their systems state accurately and act correspondingly (Parasuraman & Mouloua, 2018, Chapter 8: “Automation and Situation Awareness”). repro parses complex technical information and straightforwardly presents them. It, therefore, enables even relatively inexperienced users to make use of tools which otherwise would require extensive training. From the perspective of the author who creates an analysis, repro is a toolbox which provides the right tools for most of the user’s requirements. In the creation phase, the benefit for inexperienced users is even more accentuated. Following the standards of Peikert & Brandmaier (2019) usually requires experience in the use of Bash, Make, Docker, YAML, (R-)Markdown, R, Latex, knitr, and Git. Besides some practice, users need to be aware, remember, and correctly implement best practices in each tool. repro significantly lowers this threshold. repro still expects the user to have some training in the use of R, (R-)Markdown, and Git—which is increasingly recognized as a standard in the Open Science and R community—but forgoes training in Make and Docker. An essential guideline in designing repro was to make best practices more accessible to implement than their alternatives. Where possible repro allows the user to forget about the details of the often tricky implementation and abstracts them away.

The structure of this chapter mirrors the workflow in Figure 3.1, first from the perspective of a contributor, then from the perspective of a creator. It explains step by step how to:

  1. Set up the required software;
  2. reproduce an analysis that follows the here outlined standards;
  3. apply and publish changes;
  4. create a reproducible workflow from scratch.

repro supports several alternative software implementations for each step and integrates them into one coherent workflow. This modular structure is inspired by the usethis-package (Wickham & Bryan, 2020). The steps described here follow the recommendations of Peikert & Brandmaier (2019), and hence combine RMarkdown, Git & GitHub, Make and Docker.

3.1 Setup

Reproduction, change, and creation of an analysis require the user to have software installed that is specific to the workflow they choose, but independent of the analysis. A set of functions (following the pattern check_*, for a complete list see help(check)) assists users to ensure that everything they need is installed and correctly configured. If repro detects that something is not installed or configured, it guides users through a step by step procedure on how to resolve these issues in accordance with their specific software platform. Currently, it supports all major operating systems (Windows, OS X, Linux).

First users have to install repro. The following code snippet installs repro from GitHub:

If repro is installed one may load it via:

> ℹ repro is BETA software! Please report any bugs:
>   'https://github.com/aaronpeikert/repro/issues'
> 

Subsequently, users can check if the required software is already installed. The workflow by Peikert & Brandmaier (2019) depends on Git (and GitHub), Make, and Docker. Consequently, the following commands check if the user has set up all the requirements:

> ✔ Git is installed, don't worry.
> ✔ Make is installed, don't worry.
> ✔ You are inside a Docker containter!

If everything is set up, users can proceed to reproduce an analysis that conforms to this workflow.

If not, e.g. because Docker is not installed, users get an informative message appropriate for their platform (the following code chunk shows the message windows users get when Git is missing).

> ✖ Git is not installed.
> ℹ We recommend Chocolately for Windows users.
> ✖ Chocolately is not installed.
> ● To install it, follow directions on:
>   'https://chocolatey.org/docs/installation'
> ℹ Use an administrator terminal to install chocolately.
> ● Restart your computer.
> ● Run `choco install -y git` in an admin terminal to install Git.

3.2 Reproduction

This thesis was written according to the proposed standards using repro and may serve as an example of reproduction. I also provide a minimal example that contains a data analysis in the Creation section below.

GitHub, Make, and Docker are sufficient to reproduce this very document. So if you followed the steps above, everything is set up to download the source files of this document, rerun the code within it, and verify its results.

The following command uses Git and GitHub to:

  1. create a copy of the project, called “fork”, in your GitHub account;
  2. download this copy to your computer, and
  3. verify that all files are intact and open them in a new RStudio instance.

If executed, this code opens a new R session, and therefore, all code from here on out needs to run in the new session.

It is tempting to automate the reproduction part entirely and use a rerun() function that figures out what to do and does so for you. However, I decided the reproduction must be feasible without the repro package to avoid monopolization of reproducibility by a single software package. This decision is supposed to ensure that long term reproducibility does not depend on the availability of the package. I am confident that Git, Make, Docker will be available for years to come, whereas I cannot say the same about this package. To balance the needs of long term support and usability, repro offers advice about what to do, but stops right before doing it (you can see an example after the next code chunk).

Which steps one has to take, depends on the tools chosen to implement dependency management. This tool determines the “entry point” for an analysis. To detect the entry point, repro follows simple heuristics, which are informed by what most R users tend to use. These conventions are quite ambiguous, but the most explicit entry point is a Makefile. If no Makefile is available, the alternatives are either a central dynamic document (RMarkdown, Jupityer Notebook) or a primary script (R, Python, Oktave, Shell). In these cases, one can only guess from filenames like manuscript.Rmd, analysis.Rmd, paper.Rmd, run.R or analysis.R.

To recreate this document you have to follow the steps below:

> ● To reproduce this project, run the following code in a terminal:
>   make docker &&
>   make -B DOCKER=TRUE

The argument cache = FALSE ensures that everything that can be recreated is recreated even when nothing was changed.

It is challenging to verify wether an analysis was reproduced. As a minimum standard one could demand, that the analysis is rerun free of errors or, as a maximal standard, that the resulting documents are exactly the same. Neither solution strikes the right balance because error-free does not imply the same results. At the same time, the comparison of binary files often leads to spurious differences, e.g. because of numerical instabilities.

Currently, researchers need to revert to manual checking and common sense to verify a successful reproduction. An automated verification procedure would require the researcher to state which results need to be identical explicitly. Then a software solution could track changes for only these digital objects and accordingly flag mismatches.

3.3 Change

For a researcher, reproducing an analysis and verifying its results, is often only a first step to make intentional changes. How researchers contribute to a project lays strictly outside the realm of reproducibility, but warrants discussion because easy collaboration is one of the most significant practical advantages of reproducibility. That the primary beneficiary of this advantage is the researcher collaborating with its past self is a pun in the open science community that bears some truth. However, the workflow of an external researcher contributing is more complicated and hence here described. It is quite a challenge to collaborate under the circumstances where people do not work side by side or even know each other. The core challenge is to allow the original creator full control over changes without burdening them too much. This problem confronted the open software community from its very beginning, and they came up with the following solution. A contributor first creates a public copy, makes and tracks changes to it and then asks the original owner to incorporate the changes. In the terminology of GitHub, the public copy is a “fork”, the tracked changes are “commits”, and the call for including the changes is a “pull request”.

Working with pull requests is easy, thanks to the usethis package. If you reproduced this document, you could make changes to it—which could be something trivial, like correcting spelling—and ask me to incorporate them. You can initialize a pull request with:

Subsequently, you may change the files of this thesis as you like and track them with Git. You should make sure that the analysis is still reproducible with:

> ● To reproduce this project, run the following code in a terminal:
>   make docker &&
>   make DOCKER=TRUE

If you are satisfied with the changes you made, you can trigger the pull request with:

If I—as the author of this document—would like to accept, I can incorporate the changes on GitHub. If not, the changes can be discussed in the pull request, or I could make amends before merging them.

Such a distributed workflow allows for a much more controlled way of collaboration as opposed to mailing back and forth or using cloud storage systems. This higher level of control matches the high standards of scientific work. However, even more important is that this kind of collaboration scales well for many collaborators (Git was originally developed for the collaboration on the Linux kernel, where as of 2017 more than 15.000 developers contributed code (The Linux Foundation, 2017)). Empirical studies require a lot of work, which is usually distributed on many shoulders. As the authors carry the responsibility for the overall correctness, they ought to vet every single contribution.

Affirming the correctness of a contribution can be partly automated by confirming successful reproduction. Such automatic checks of changes are part of a software developing process, called continuous integration. Continuous integration runs code in cloud computing environments that asserts the correctness when changes are pushed to GitHub. In many ways, continuous integration is the logical next step for reproducible workflows. Because much effort was already invested in ensuring reproducibility across computers, it is easy to move the analysis to a continuous integration tool.

Hence, if you created a pull request, the continuous integration tool GitHub actions, will rebuild this document, affirming reproducibility and let me see the results of your changes.

3.4 Creation

Reproducing an analysis and creating a reproducible analysis are two very different issues. The repro packages main strength lies in simplifying the creation. First, the repro package comes along with a minimal, but comprehensive template including an example RMarkdown, R-script and data. This template can be accessed from within RStudio © via “File” → “New Project” → “New Directory” → “Example repro template”, or from any R console via:

repro infers the dependencies to data and external code as well as the required packages from the yaml metadata of the RMarkdowns. Because analytic data projects have a particular structure, this markup can be much simpler than writing Dockerfiles and Makefiles manually. While Docker allows installing arbitrary software, an analysis in R likely needs nothing but R-packages. Similarly, Make enables you to run any software, but an analysis in R only needs to execute R-Scripts and render RMarkdowns.

Hence, a minimal addition to the metadata, as can be seen in the following example, contains everything necessary to infer a complete Dockerfile and Makefile:

From this specification, the function automate() creates a Dockerfile and a Makefile, which comply with all recommendations by Peikert & Brandmaier (2019). Strictly speaking, it creates four Dockerfiles and three Makefiles. Most of the files are created in the .repro directory and then assembled into the main Dockerfile/Makefile at top-level. One Dockerfile contains the base docker image, including the R version and the current date and another Dockerfile contains only the R packages. It also produces one file where the user can amend software installation or set up steps that are not covered by repro. The Makefiles are similarly separated, with one file dedicated to RMarkdowns and another one for the required logic that executes the make commands in the container.

The automate() function is designed to simplify the workflow proposed by Peikert & Brandmaier (2019) as much as possible. Such simplification means inevitably to restrict the user’s freedom. While they can still do everything they want in the realm of Make and Docker, this approach does not allow other reproducibility software to be used. Users, which need more control and are more advanced, could instead rely on the modular nature of repro. Each component can be added to the project by the use_* functions. E.g. use_make() adds a basic Makefile or use_make_singularity() adds a Makefile that is compatible with Singularity (which is an alternative to Docker for High-Performance Computing). These functions extend the usethis-package (Wickham & Bryan, 2020), which was originally designed to facilitate package development with specific reproducibility tools.

3.5 Summary

To summarise, if you want to create a reproducible project, you can do so with the following code:

3.5.2 Check required reproducibility software

References

Parasuraman, R., & Mouloua, M. (2018). Automation and Human Performance: Theory and Applications. Routledge.

Peikert, A., & Brandmaier, A. M. (2019). A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/8xzqy

The Linux Foundation. (2017, October 25). 2017 Linux Kernel Report Highlights Developers’ Roles and Accelerating Pace of Change. https://www.linuxfoundation.org/blog/2017/10/2017-linux-kernel-report-highlights-developers-roles-accelerating-pace-change/

Wickham, H., & Bryan, J. (2020). Usethis: Automate package and project setup. https://CRAN.R-project.org/package=usethis