start2finish is a project I created to share resources and provide a simple starting place for building a reproducible project. When I created this, I had my mentors and peers in mind, hoping that this guide would take away the fear from creating an R package for their own projects.
R from start to finish: Organizing your dissertation work with a reproducibility mindset using R and RStudio
I presented this poster at the Ecological Society of America's Annual meeting: ESA 2020 abstract and ESA poster
Have you ever experienced running old code and having it break? Or found the code associated to a scientific article, and when trying to understand it or run it, realize it is almost impossible to figure out?
As an ecologist, and as many of us, I started my journey with R and data analysis as a self-directed adventure, first learning to code, and later realizing about the importance of reproducibility. Particularly associated with R programming, there is an overwhelming number of resources for reproducible research. This poster is meant to be a resource, a short guide and starting point for setting up a reproducible workflow in R.
Most of the workflow relies on the usethis
package and you can find a short tutorial on building packages here.
Perhaps this depends on the type of work you do, I'm not an expert and don't have particularly strong opinions. However, a package makes you follow certain conventions to keep things organized. I am a fan of writing functions in an R package and writing detailed documentation using the 'roxygen2' package. Building a package is a nice way to keep things together, organized, and clear. Although I am sure that you can also create a chaotic package as well.
Although you don’t need these two for setting up a project in R but maintaining version control is highly recommended and fundamental for reproducibility. This means that there is a history for your code and analysis. Connecting Git and GitHub to RStudio is system dependent, a good resource for this process can be found in happygitwithr.com
Before you run these steps, make sure you have installed the following packages: usethis
, roxygen2
, renv
, here
.
- Create the package, add a license if you want to, and if feeling adventurous you can create a GitHub repo for it. The reference functions for the
usethis
package can be found here. Running this function will open a new R session with your package!
usethis::create_package("your package path")
- Keep track of the packages that you use with the
renv
package.
renv::init()
renv::snapshot()
- Use the
here
package and avoid starting scripts withsetwd("your/specific/path/that/does/not/work/on/another/computer)
. I will be honest, I had a hard time understanding this package, until I ran across Jenny Richmond's post on how to use thehere
package. It comes down to the difference in file paths between .R and .Rmd files.
here::here()
- Use
dplyr
or base R, to clean your data using R scripts. Any changes or deletions that happen in the spreadsheet are lost and forgotten in the realm of non-reproducible clicks. Clean your data with scripts so that you can always go back to the original and be certain of what changes have been made during the cleanup. Broman & Wu, 2018 has great advice on working with spreadsheets. - Write your analysis and even your manuscript in
rmarkdown
. There are several packages out there that usermarkdown
and will help set up different types of articles. You can even create presentations withrmarkdown
. For simplicity, if usingrmarkdown
and version control (Git and GitHub), you can avoid having several final.docx versions of your work.
When I am starting a new project, I follow these steps:
usethis::create_package("projects/mypackage")
usethis::use_mit_license(name = "Your Name")
usethis::use_git()
usethis::use_github()
usethis::use_readme_rmd()
These steps will create my package, my GitHub repo and a README with rmarkdown
so that I can include chunks of code and figures with it. After that setup I will start tracking my packages:
renv::init()
renv::snapshot()
I will load some of the packages I know I will use in my work:
usethis::use_package("dplyr", "ggplot", "fitdistrplus")
And then save the changes with
renv::snapshot()
After the snapshot, you can commit your changes, and push them to your repo so that your lockfile (revn.lock
) is updated. Any time that new packages are loaded, you repeat these steps.
You can create your first script, add a function with descriptions, and use roxygen2
for that. You can find a short tutorial here
usethis::use_r("name of your script")
This setup is intended for you to take the leap, and get started. There are a number of resources out there, perhaps too many sometimes. If you'd like to jump over to "how do I write my manuscript in rmarkdown" you should definitely check out Anna Krystalli's Reproduce a paper in Rmd and follow some of the resources bellow.
- Anna Krystalli (@annakrystalli) and her talk “Putting the R into Reproducible Research”
- Sharla Gelfand (@sharlagelfand) and her talk at rstudio::conf(2020)
- Karthik Ram (@_inundata), his GitHub repo and talk at rstudio::conf(2019)
- ‘thesisdown’ repo from Chester Ismay (@old_man_chester) – this one is specific for dissertation writing
- ‘rrtools’ project and Ben Marwick (@benmarwick), who also has several publications on this topic.
- Reproducibility in science – guide, from rOpenSci
- Open Science Framework (@OSFramework)
- Broman, K. W., & Woo, K. H. (2018). Data organization in spreadsheets. The American Statistician, 72(1), 2-10.
- Jenny Bryan's STAT545
- Don’t forget your research services or reproducibility librarian!
- Boettiger, C. (2018), From noise to knowledge: how randomness generates novel phenomena and reveals information. Ecol Lett, 21: 1255-1267. doi:10.1111/ele.13085