Workflow Orchestration
ReproDude
Hey, I’m your ReproDude for this chapter. If you have any questions click on me and we can talk!
What now?
We have already reached the last chapter of content! What is it about?
As usual, let’s look at our components. Which is the last missing piece for fully automating reproducibility?
code + data + text + history + software + workflow
Workflow, I think I heard of that. Let’s peek again at our problem and software solution lists to be sure.
Problem list:
copy&paste mistakesinconsistent versions of code or datamissing or incompatible software- complicated or ambiguous procedure for reproduction
Software solution list:
- RMarkdown
- Git
- Docker
- Make
So that means we use Make to avoid the problem of complicated or unclear procedure for reproduction? Let’s jump right in!
Make?
Make
Click on me for the documentation of Make!
To understand what Make is, let’s take a step back. To do that, let’s move on to a topic that I’m really passionate about. Food. And even one step further back, cooking. What is cooking?
Besides the definition of content, cooking is a sequence of dependencies. Let’s take a look at my favorite recipe.
Arrabbiata sauce, or sugo all’arrabbiata in Italian, is a spicy sauce for pasta made from garlic, tomatoes, and dried red chili peppers cooked in olive oil.
“Arrabbiata sauce” from Wikipedia under CC BY-SA 3.0
And what are the steps in this masterpiece of craftsmanship?
Arrabbiata sauce, or sugo all’arrabbiata in Italian, is a spicy sauce for pasta made from garlic, tomatoes, and dried red chili peppers cooked in olive oil.
“Arrabbiata sauce” from Wikipedia under CC BY-SA 3.0, emphasis added
Despite the beauty inherent in this recipe, we also want to formalize it. Let’s see how this recipe looks in a formalized version:
arrabiata.pdf: arrabiata.Rmd sauce.csv R/pasta.R
Rscript -e "rmarkdown::render('arrabiata.Rmd')"
sauce.csv: R/cook.R tomatoes.zip aromatics.yaml
Rscript -e "source(R/cook.R)"
aromatics.yaml: R/sizzle.R garlic.txt chili.json
Rscript -e "source(R/sizzle.R)"
Let’s break it down for our understanding. arrabiata.pdf is our sauce that we want to have. This sauce is created based on the files arrabiata.Rmd, our recipe, using sauce.csv and /pasta.R.
sauce.csv is created with R/cook.R with the raw materials tomatoes.zip and aromatics.yaml.
aromatics.yaml in turn is created by R/sizzle.R with the raw materials garlic.txt and chili.json.
If you have trouble understanding the recipe, don’t worry, we’re all just virtual people. Maybe a conversation with ReproDude will help you?
After looking at the arrabbiata sauce, here is another example of a culinary masterpiece.
Brownies.
Just kidding, let’s get out of here or I’ll need a lunch break.
Instead, let’s point out the key features of this kind of recipe by Make:
Missing ingredients will be generated,
e.g., if the cleaned data is missing, the raw data is first cleaned.Newer ingredients trigger updates,
e.g., new data leads to the recreation of the whole manuscript.Always the same “button” that triggers reproduction,
e.g., regardless of programming language, file format, and intermediate steps.
And the great thing is that repro::automate() automatically generates recipes for Rmds, only deeper nested dependencies must be added manually.
Hands on!
But now we have had enough thinking, let’s go back to the code!
Go to the terminal (Alt + Shift + M) and type make inflation.pdf
[Hint: If you did not change output: html_document
to output: pdf_document
in inflation.Rmd
you need to use .html
instead of .pdf
.]
Delete inflation.pdf.
Add inflation.pdf to the target all (e.g.,: all: inflation.pdf
) in the Makefile.
Go to the terminal (Alt + Shift + M) and type make
Run: make -B --dry-run
[Hint: -B means rebuild everything. –dry-run means do not actually run the commands.]
And now that we’ve created the recipe, let’s take it one step further and Make it in the cloud.
Run repro::use_make_publish()
Paste the following into the Makefile:
publish/: inflation.pdf
include .repro/Makefile_publish
Back to the Console, run repro::use_gha_publish()
Commit and push.
Inspect .repro/Makefile_publish.
Ready yet?
All right, what did we just do?
We created and send a Makefile to the cloud using Make and the repro package. This makefile will ensure that we can track how our sauce is created!
And now?
Congratulations, we are more or less near the finish line!
Before we finish, let’s take one more quick look together at what we just did. We now have all the components from our toolbox together.
code + data + text + history + software + workflow
And with that, we solved our last problem on the list:
copy&paste mistakesinconsistent versions of code or datamissing or incompatible softwarecomplicated or ambiguous procedure for reproduction
And which software did we use?:
- RMarkdown
- Git
- Docker
- Make
Almost the Final Final Step!
Now please go through all the things we just did and all the software we used.
You are currently at your computer using Posit Cloud, which hosts an R environment where you used Make and the repro package to generate a Makefile which automates the reproducibility of build processes.
You have now reached the advanced status of reproducibility; there is only one last step to show the world what we have learned here.
Publish
With this step we will now publish the document automatically, for this we will use GitHub again.
Go back to GitHub.
Go to Settings → Pages and change none to gh-pages.
Go to actions and wait for the new action to finish.
Inspect the published PDF online (change username):
yourusername.github.io/project/inflation.pdf
Change something, e.g., make the plot ugly again, then commit and push. Takes ~5min or so.
The great thing about this automation is that it now automatically builds and publishes your document every time you push a new version to GitHub.
And with that, we have concluded the workshop in terms of content! In the following chapter, we’ll just go through some suggestions on how you can continue your Reproductions journey.
You are now ready for the last chapter. next chapter.