Chapter 2 Technical Solutions
This section summarises the workflow proposed by Peikert & Brandmaier (2019; see also The Turing Way Community et al., 2019 for a similar approach). They argue that publicly sharing code and materials is not sufficient to ensure reproducibility: Instead, reproducibility has to rest on five pillars:
- file management
- a folder containing all files, referring to each other using relative paths
- literate programming
- a central dynamic document, that relates code to thought
- version control
- a system in place that manages revisions of all files over time
- dependency management
- a formal description of how files relate to each other
- containerisation
- an exact specification of the computational environment
These pillars stipulate the relations between thought, code and data with their change over time and environment and hence, they reach all requirements of reproducibility through being:
- understandable to other researchers,
- transferable across machines,
- preserved over time.
While comprehensibility to the scientific community is probably the most crucial objective, it is also the most difficult to achieve. That is because as a non-technical requirement, no set of rules can assure its fulfilment (though clear writing1 and clean code2 certainly help). Transfer and conservation, on the other hand, are problems with technical solutions.
Peikert & Brandmaier (2019) propose to use a combination of RMarkdown, Git, Make, and Docker, for users of the R programming language (R Core Team, 2020) and provide the high-level overview which can be seen in Figure 2.1.
However, they stress that any combination of tools is suitable as long as it facilitates the above pillars.
Each of the following sections first raises a challenge for reproducibility, then outlines a conceptual remedy along with a concrete tool and concludes how they relate to the package repro
.
The relation of these tools with repro
is then expanded in the next chapter.
2.1 File Organisation
File organisation has to meet two challenges: First, the structure needs to be understandable for others, and second, it needs to be self-contained so that it can be moved to another machine.
Adhering to conventions can help other people understand how files are organised.
For example, the filename R/reshape.R
both follows standard naming conventions (i.e. all lowercase; ends with .R; placed within the R directory) and is meaningful. Contrarily, myScripts/munge_Data.r
is probably a lot harder both to understand and to remember for most R-users.
Following two guidelines makes the file structure self-contained:
- Everything is in one folder.
- Every path is relative to that folder.
This simple concept of a self-contained folder is facilitated by two R specific tools: RStudio projects and the here
package (Müller, 2017).
The former exempt the user from changing the working directory manually; the latter infers absolute paths from relative ones.
However, unlike the native R solution, this inference is consistent across operating systems, scripts and RMarkdowns.
The repro
package offers a template for an RStudio Project, which sets up a file structure that follows best practices and conventions.
This template provides the researcher with a minimal example of a reproducible analysis.
It enables researchers to learn by easy to understand examples, e.g. how files should refer to each other.
Such learning by example complements the abstract arguments for best practices with a more hands-on perspective.
The researcher can thus adapt code and files to their need, either by merely changing them manually or by the modular structure of repro
.
Furthermore, does this template provide a reference for researchers that is more concise than any tutorial and connects the concept of file structure, with the ideas of the following sections.
2.2 Dynamic Document Generation
A clear file structure helps researchers to understand better how a scientific result relates to the code, but the otherwise strict segregation of code and document may obscure how both connect, e.g. which parts of code generate which results. Providing a direct link, dynamic document generation allows interspersing text with code and its results, in order to produce a human-readable document. The key feature is that every time such a document is rerun, the results are reproduced dynamically. This functionality eliminates errors due to copy and paste results from statistical software to a text processor. This mistake happens far too often; Nuijten et al. (2016) reports that 50% of papers from the psychological sciences contain at least one error that could have been prevented.
RMarkdown provides a convenient framework to write such dynamic documents and render them as a wide range of output formats3. In an RMarkdown, three parts can be distinguished:
- one specifying its output and metadata,
- one containing code, and
- one with descriptive text.
Each part uses its own language, all of them designed with ease of use and readability in mind. The one section containing the output format and other metadata alongside is written in YAML (recursive acronym for “YAML Ain’t Markup Language”, see the example below). This specification is located on the top, separated by three dashes at the beginning and the end of the section. (R-)Code executing an analysis can be placed in a distinct chunk or inline within the text. The former has three backticks on their own line signifying beginning and end. The latter is quoted in a pair single backticks. Examples of both methods can be found below. Text which is not fenced by either three dashes or backticks is interpreted as literal text written in the Markup language “Markdown”. Markdown allows annotating text to signify formatting like bold, italic, links and the inclusion of images. This markup is designed to be well readable even as source files.
The following section shows examples of metadata, code and text, specified as above described, forming a minimal example of an RMarkdown (adapted source code from Xie et al. (2019)/CC BY-NC-SA 4.0):
---
title: "Hello R Markdown"
author: "Ross Ihaka & Robert Gentleman"
date: "1997-04-23"
output: pdf_document
---
This is a paragraph in an R Markdown document.
Below is a code chunk:
```{r}
fit = lm(dist ~ speed, data = cars)
b = coef(fit)
plot(cars)
abline(fit)
```
The slope of the regression is `r b[1]`.
Resulting in the rendered document seen in Figure 2.2:
Undeniably, RMarkdown facilitates reproducibility greatly, but it cannot ensure reproduction.
To virtually guarantee reproduction, the repro
package extends the YAML metadata to incorporate Dependency Management and Containerisation into the process of dynamic document creation.
It, therefore, extends the existing metadata model of RMarkdown to capture all dependencies of the analysis and ensures reproduction.
Because many researchers are already familiar with RMarkdown this solution provides minimal cognitive overhead.
However, even though the metadata relates to a single document, repro
compiles all needed information from all documents across the project into suitable Docker- and Makefiles.
2.3 Version Control
Text, code and results of a scientific document are typically modified in cycles of many revisions. As changes accumulate, different versions do so as well, posing a problem for reproducibility as it may be challenging to find out which version of code relates to the final product. One may argue that in the typical publication process, the final product is apparent: the published paper. However, reproducibility may be crucial even before publication as part of the collaboration and within the peer-review process. Also, recent trends in the publication process as preprints, open review, registered reports and post-publication review, blur the lines between published and unpublished.
To organise different versions as changes accumulate across the phases of a project across machines and users is a well-known challenge in software development. This challenge is met by a high degree of automation that keeps track of different versions and has advanced facilities to compare and merge them.
One such version control software is Git. Git tracks versions of a project folder by taking snapshots of a given state called commits. Each commit has a unique ID, called a hash, as well as a short description of the changes made, called commit message and a link to the previous commit. This linking procedure creates a “pedigree” of versions that makes it easy to see how things have evolved. Going back in time to a specific version only requires knowing the hash of the commit. To mark commits as special milestones, they can be tagged, e.g. as preregistration, preprint, submission or publication.
While mastering Git requires some experience, most of the time, only four commands are needed, all of which can be accessed through RStudio’s Git interface:
- git add
- take a snapshot of the given file
- git commit
- create a commit of all added files
- git push
- upload recent commits to a server
- git pull
- download and integrate recent commits from the server
While a few other commands are necessary to set up Git in a given project directory, this work is done by the repro
-package.
In my experience, Git is overwhelming to learn because users have to interact with it through a terminal.
While R users have some experience with interactive programs that can only be accessed through written commands (e.g. R itself), Git follows—from the perspective of R users—strange and archaic syntactical rules.
Thanks to usethis
and indirectly through credentials
(Ooms, 2019a) and gert
(Ooms, 2019b) users can use most of Git through R commands.
Because repro
has the same interface as usethis
the cognitive overhead is kept as low as possible.
2.4 Dependency Management
In a scientific analysis of empirical data, the statistical or computational results depend on code which in turn depends on data. However, rarely the data is analysed as it is, but some code is dedicated to preparing it (e.g. removing outliers, reshaping, aggregating, artefact correction etc.). Most likely, each analysis needs a slightly different version of the data. An analysis of missingness requires the missings to be retained, but some statistical models do not allow that. Or the modelling software requires data to be differently shaped, then the plotting library. Often it is the case that one analysis is based on the output of another and so forth. As these relations can become quite complicated, it is necessary to make them explicit to avoid confusion. Dependency management provides a formalism that describes how files depend on other files. More specifically, it provides an automated way to create files from other files, e.g. it automatically generates a cleaned version of the data, by relying on a cleaning script and the raw data.
Such relations may be layered; hence, if a plot requires this cleaned dataset, first the cleaned dataset and then the plot is generated automatically. Such structure allows to save considerable computing time, as dependencies are not generated again if they already exist, but only if one of their dependencies has changed. In this example, upon recreation of the plot, the cleaned dataset is not generated as long as the cleaning script and the raw data remain unchanged. Such intelligent behaviour is most useful when the preprocessing requires a lot of computing time as is typical in neuroimaging or machine learning.
Make is a tool for dependency management. While originally designed for the compilation of programs, it is now increasingly recognised as a tool for ensuring reproducibility. It allows for all features mentioned above and even more as it a full-fledged programming language.
The repro package provides a simplified interface to the essential features of Make, eschewing the need to learn yet another language while leveraging its most important feature: “dependency management” for scientific data analyses. Similarly to Git Make uses an interface that is quite different from the way R is used.
Because Make doesn’t have to be used interactively, repro
reuses the interface of RMarkdown to interact with Make.
Users hence need only to learn RMarkdown and gain the ability to use Make (and as described in the next section Docker) if they use repro
.
2.5 Containerisation
Most computer code is not self-contained but depends on libraries and other software to work (e.g. the R programming language or packages). These external dependencies pose a risk for reproducibility because it may not be clear what is necessary—besides the code and data—and how to install it. Even when all required software and their exact versions are recorded meticulously, it may be a challenge to install them on a different system. First, it is difficult to maintain different software versions on the same computer, and second, it may be unclear how to obtain an exact copy a specific software version. Setting up a computer exactly as someone else’s is difficult enough, but replicating another computers state from several years ago is at best painstaking, if not impossible.
To overcome this challenge, the software environment of a project needs separation from the rest of the software environment. Technically such separation is called virtualisation because one software environment is hosted on another. Such virtual environment allows each project to have its own software environment without interfering with each other. Hence, such setup is ideal for conservation and can easily be recreated on another machine.
Docker allows virtualisation of the whole software stack down to the operating system, but in a much more lightweight way than traditional virtual machines. This lightweight but comprehensive virtualisation is called containerisation. Containers save storage by being based on each other, enabling reuse. Hence, one container is based on another, e.g. two containers with the same R version share the storage for everything but the different R packages. Containers are created from a plain specification called Dockerfile
. This file defines on which container the result should be based upon and what software should be installed within it.
The repro
-package automatically infers which packages are needed and creates an appropriate Dockerfiles and the container from it.
This process relies on the same syntax as is employed for Make, so that users do not have to differentiate between file and software dependencies.
E.g. for this project the repro
-generated Dockerfile looks like this:
FROM rocker/verse:3.6.3
ARG BUILD_DATE=2020-07-16
WORKDIR /home/rstudio
RUN MRAN=https://mran.microsoft.com/snapshot/${BUILD_DATE} \
&& echo MRAN=$MRAN >> /etc/environment \
&& export MRAN=$MRAN \
&& echo "options(repos = c(CRAN='$MRAN'), download.file.method = 'libcurl')"\
>> /usr/local/lib/R/etc/Rprofile.site
RUN install2.r --error --skipinstalled \
bookdown \
devtools \
gert \
here \
usethis
RUN installGithub.r \
aaronpeikert/repro@adb5fa56
Furthermore, does repro
provide the infrastructure that connects Make and Docker, e.g. it adds appropriate Make target for the container’s image, and tunnels the other Make targets through the container.
This results in a special Makefile for the Docker-related instruction that is incoporated in the main Makefile e.g. like this:
### Docker Options ###
# --user indicates which user to emulate
# -v which directory on the host should be accessable in the container
# the last argument is the name of the container which is the project name
DFLAGS = --rm $(DUSER) -v $(DDIR):$(DHOME) $(PROJECT)
DCMD = run
DHOME = /home/rstudio/
# docker for windows needs a pretty unusual path specification
# which needs to be specified *manually*
ifeq ($(WINDOWS),TRUE)
ifndef WINDIR
$(error WINDIR is not set)
endif
DDIR := $(WINDIR)
# is meant to be empty
UID :=
DUSER :=
else
DDIR := $(CURDIR)
UID = $(shell id -u)
DUSER = --user $(UID)
endif
### Docker Command ###
ifeq ($(DOCKER),TRUE)
DRUN := docker $(DCMD) $(DFLAGS)
WORKDIR := $(DHOME)
endif
### Docker Image ###
docker: build
build: Dockerfile
docker build -t $(PROJECT) .
rebuild:
docker build --no-cache -t $(PROJECT) .
save-docker: $(PROJECT).tar.gz
$(PROJECT).tar.gz:
docker save $(PROJECT):latest | gzip > $@
References
Müller, K. (2017). Here: A simpler way to find your files. https://CRAN.R-project.org/package=here
Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2016). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 48(4), 1205–1226. https://doi.org/10.3758/s13428-015-0664-2
Ooms, J. (2019a). Credentials: Tools for managing ssh and git credentials. https://CRAN.R-project.org/package=credentials
Ooms, J. (2019b). Gert: Simple git client for r. https://CRAN.R-project.org/package=gert
Peikert, A., & Brandmaier, A. M. (2019). A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/8xzqy
R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/
The Turing Way Community, Arnold, B., Bowler, L., Gibson, S., Herterich, P., Higman, R., Krystalli, A., Morley, A., O’Reilly, M., & Whitaker, K. (2019). The Turing Way: A Handbook for Reproducible Data Science. https://doi.org/10.5281/zenodo.3233986
Xie, Y., Allaire, J. J., & Grolemund, G. (2019). R Markdown: The definitive guide.