Dynamic Documents
ReproDude
Hey, I’m your ReproDude for this chapter. If you have any questions click on me and we can talk!
What do you see before you?
First of all:
Look around, become comfortable with Posit/RStudio Cloud. Then, locate the files pane.
You are currently at your computer using Posit Cloud which hosts an R environment that you are looking at. Importantly, any actions that you do locally on your computer (e.g., with your locally installed, R, Git, terminal, etc.) will have no effect on the cloud environment.
Do you feel comfortable yet? Yes? Then let’s move on. If not, take as much time as you need and move on as soon as you want, I have no more Appointments today.
Best practices?
To give you a sense of how reproducibility can be increased, let us look at some best practices:
Open the R folder.
You should now see three different R files.
Click on the file R/prepare_games.R
Now try to understand the code.
Which of these do you already do? What could you improve?
You want to see more? Just for you (same thing but more complicated):
Take a look at R/prepare_inflation. Remember this blue means a task is optional.
A summary?
Let us summarize this in an incomplete list:
- List requirements early
- What does have to be installed?
- What datasets have to be aquired from where?
- What computational resources must be present?
- Use relative locations
- relative paths,
./data/
instead of/documents/aaron/data/
- names, instead of indices,
data[["id"]]
instead ofdata[[1]]
- relative paths,
- Document relevant information
The first two points are more or less clear, if you would like to talk about that, maybe you want to give ReproDude a try?
But let’s talk about the last point. Does that mean that every piece of documentation has to be a mile long?
No… not necessarily.
But how do I decide what should or should not be included in my documentation? Let’s make another list:
- What is standard does not have to be documented.
- if you use, e.g., a CSV file, you do not need to explain that values a seperated by comma
- What is easy needs only little documentation.
- if your code is easy to understand, you don’t have to write a comment explaining it
- What is consistent only has to be documented once.
- sometimes things are complex and there is no way around it, e.g., you have a messy JSON dataset that must be cleaned, however, if you write a function for cleaning and reuse it, there is no need to repeat yourself in explaining how this works.
If you follow the best practice of:
- List requirements early
- Use relative locations
- Document relevant information
- What is standard does not have to be documented.
- What is easy needs only little documentation.
- What is consistent only has to be documented once.
you already are in a pretty good position. However, we could spend a whole day on how to code in way that is easy to follow and all the best practices there. Instead, we want to focus today on how to automate reproducibility assuming you allready have an R script that is easy to understand. For everything else we are missing the time today and there are simply to many different skill levels present in such a workshop.
Extended Goals?
The goal today is conceptually simple:
raw data
→ automatically → final paper
We want to create a system that takes your analysis and reproduces it automagically. Therefore, whenever you make a change to your analysis you get the results without lifting a finger. Crucially, since it is reproduced by an external system this proofs to you that your results are reproducible by any external person. In fact, when I say results, what I really mean is the whole article including text, tables, and figures. An added benefit is that collaborators, readers, and the future you, can simply change things online withou installing anything and can expect to get the full paper reproduced with these changes.
The goal, therefore is create a neat bundle of the following components that we can send to others so they have everything they need for reproduction:
code + data + text + history + software + workflow
But haven’t we already looked at parts of it? Yes!
code + data + text + history + software + workflow
We have already dealt with code and data (or rather I assume that is solved).
This is a schematic overview of the system we want to build today:
We build this system to solve the following problems:
- copy&paste mistakes
- inconsistent versions of code or data
- missing or incompatible software
- complicated or ambiguous procedure for reproduction
Using:
- RMarkdown
- Git
- Docker
- Make
That is, the component text has the problem of copy&paste mistakes and we use RMarkdown to solve the problem.
RMarkdown
RMarkdown
I will give you more information about RMarkdown by taking you to their documentation.
Now let’s go back to your R cloud environment.
Open the file inflation.Rmd.
If you are still in the R folder, you have to go up one level again.
Take a minute to skim the document (Or even two).
Click on knit.
And if you want to admire more examples, click on RMarkdown Gallery.
Metadata (YAML)
Now that you have looked at at least one RMarkdown example, did you notice the following part?:
---
title: "Inflation Data"
author: "Aaron Peikert"
date: "2024-11-15"
output: html_document
repro:
packages:
- here
- tidyverse
- lubridate
- aaronpeikert/repro@fc7e884
scripts:
- R/prepare_inflation.R
data:
- data/raw/inflation.rds
---
This is the metadata, in YAML format, for the RMarkdown document.
Change html_document to pdf_document
Knit again.
What happened?
You want to experiment more?
Change the author or date field. Try the tufte-format (click me).
Text (Markdown)
In the document, you probably also noticed this part:
The dataset we use stems from the [Bank of England Research datasets](https://www.bankofengland.co.uk/statistics/research-datasets).
I quote:
> This dataset contains the individual responses to our Inflation Attitudes Survey, a quarterly survey of people’s feelings about inflation and other economic variables like the interest rate.
This is Markdown and we will use it to write and format the actual text.
Make something bold and something else italic:
This is **bold**, while this is *italic*.
Go to Help → Markdown Quick Reference and try something out.
Code (R)
But there is yet another component in the document. This part contain code blocks like:
```{r}
inflation %>%
group_by(date) %>%
summarise(across(c(perception, expectation),
~ mean(., na.rm = TRUE)),
.groups = "drop") %>%
pivot_longer(c(expectation, perception)) %>%
ungroup() %>%
ggplot() +
geom_line(aes(date, value, color = name)) +
NULL
```
These code blocks can be not only R code but also Julia, Python, Octave and other programming languages.
Add a R code chunk (Ctrl + Alt + I) and inline code:
A code chunk is for longer code/output:
```{r}
with(mtcars, plot(hp, mpg))
```
Inline code is for single numbers/short text:
`32` cars. We have
Include all the code in output with: knitr::opts_chunk$set(echo = TRUE)
Try using python:
```{python}
print("Hello World! Python here, do you miss R already?")
```
And now?
Congratulations, another section done!
Before we continue, let’s take a quick look together at what we have just done. We have now one component more in our toolbox.
code + data + text + history + software + workflow
And with that we solved our first Problem on the list:
copy&paste mistakes- inconsistent versions of code or data
- missing or incompatible software
- complicated or ambiguous procedure for reproduction
And which software did we just used for this?:
- RMarkdown
- Git
- Docker
- Make
Final Step
Now please go through what we have just done and all the software we use.
You are currently at your computer using Posit Cloud which hosts an R environment where you used RMarkdown to write some examples.
Shall we both take a short break or do you want to continue straight away?
You are ready for the next chapter. next chapter.