-
Notifications
You must be signed in to change notification settings - Fork 9
Background Info and Resources
R is a many-faceted language. Most people "use R" - to analyze data, create graphics, and so on. The objectives when "using R" are something like the following:
- do what you need to do quickly
- write code that works on your computer
- create scripts that you know how to run
- comment and format code so you know what it does
There are many resources for learning how to use R. Some free examples are the official Introduction to R, and the book R for Data Science, by Garrett Grolemund and Hadley Wickham.
However, this page lists some resources for learning how to develop in R, which is something of a different animal. We want to:* Style. dampack
uses the package lintr
to enforce a version of this style guide.
- do what you need to do well
- write portable code that works on everyone's/most people's computers
- create functions and objects that other people will know how to use
- comment and format code so that many people can understand it and collaborate with you
A great resource for learning how to develop in R is Hadley Wickham's book Advanced R - I have some specific links in the following sections.
There are some fundamental building blocks of R development that are important to understand.
Particularly important topics are:
- Functions. The building block of R packages.
-
Object oriented programming in R.
dampack
uses basic functionality of the S3 system of OO programming. - Environments. This isn't as essential as the other 3 topics, but can be useful to understand how objects are stored and accessed within functions - see this section
For using R, it's fine in many cases to just have one script, an R Markdown document, or a few scripts that are linked together in some way. However, in many cases, that's a bad way to share your code with other people for them to use and build on. Most R developers release their code in packages, which are collections of code that you are surely familiar with if you've used R - when you write library(ggplot2)
or library(MASS)
, you're loading those packages to make their contents available.
There are many great things about packages:
- if you develop the package well, anyone with a macOS, Windows, or Linux computer should be able to run it
- the package installation machinery automatically installs any other packages your package uses
- documenting the functions and objects within the package is pretty easy - which makes it a lot easier for your users!
Hadley Wickham's book R packages is an excellent introduction to nearly all of the topics you'll need to contribute to dampack
. I'd recommend reading through it.
There are several ways that R packages are distributed. The most official is the Comprehensive R Archive Network, or CRAN. When a user types install.packages('ggplot2')
, R goes to CRAN and looks for ggplot2. The easy access is why we'll be submitting dampack
to CRAN.
However, there are some differences between distributing on CRAN and distributing via GitHub, for example. When you first submit your package, CRAN runs many tests on the package to make sure that it can be installed and contains all the necessary package files. If the package fails even one test, CRAN won't accept it! The "R packages" book has a good chapter on running the CRAN check on your computer.
Software development is usually collaborative, which means that others need to be able to run your code, understand your code, and contribute. There are many aspects to this, including good communication, but I'll talk about the technical aspects first.
There are a few things that are important to consider within R itself: unit testing, code style, and comments.
Unit testing is an important part of software development. There's a slightly abstract Wikipedia article which talks about the general philosophy.
More practically, when writing a function/class/object in R, you want to be sure that it does what it's supposed to do. In addition, you want to be sure that your new changes don't break everything you've spent 6 months on! The solution to this is unit testing, in which you write a test alongside of each new function/class/object to make sure that it works.
Hadley, once again, has a great introduction to testing: http://r-pkgs.had.co.nz/tests.html
Code style is just the way that code is formatted. Why care about being consistent with this?
- without a consistent style, code that different people wrote can look very different - this can make the code base hard to understand
- it's clear what different parts of the code do. As an example, if periods are only used in method names (
print.my_class()
), then we know that anything with a period is a method. - easier reading
In dampack, I've used the lintr
package to require adherence to a version of this style guide. This has the potential to be annoying, but I think it will make for more interpretable code.
Git is what's known as a "version control" system, which is just what it sounds like: a way to control and keep track of versions of your code. This is essential for software development. Consider some scenarios:
- I have a brilliant idea! I make big changes to my code to match this brilliant idea. Except...it turns out that it wasn't such a brilliant idea after all, and I broke everything. With Git, I can easily revert back to the earlier version of the code.
- Steph (who lives in Iceland) is working on part A of
dampack
, and Tamika (who lives in Des Moines) is working on part B. However, they both have to change part C along the way. In addition, part C is used in parts D, E, and F. Yikes! How do we reconcile these two versions, while not breaking everything else? In Git, this is known as a "merge", and can be done relatively painlessly. Combined with Github, continuous integration, and unit testing, we can even be sure that almost the whole package still works after the changes.
The Git team has a comprehensive introduction to Git called Pro Git. I'd recommend the first 6 chapters to understand the lingo and the underlying theory.
For a more practical introduction, Github has a nice Git Handbook.
So if that's Git, what's GitHub? According to the GitHub Git Handbook,
GitHub is a Git hosting repository that provides developers with tools to ship better code through command line features, issues (threaded discussions), pull requests, code review, or the use of a collection of free and for-purchase apps in the GitHub Marketplace.
Basically, GitHub:
- hosts Git repositorities on the internet
- allows for discussion and easy collaboration
- is very popular
There are others as well (like Bitbucket), but GitHub is popular and easy to work with.
They have a number of online guides. I'd recommend starting with Hello World and then moving to the Github flow - but feel free to check them all out!
In this project, we basically try to follow the "GitHub flow", which means that, to add a new feature or fix a bug:
- create a new branch off
master
- push your changes to that branch
- once you're ready, open a pull request and make sure all the tests pass (see below)
- merge your branch to
master
(or ask someone else to do it, if you don't have sufficient privileges)
In step (3) of the Github flow, I mentioned "tests". How are these tests run on GitHub, and what kind of tests are they?
To answer the first question: GitHub supports several "continuous integration" services, which can do a lot of things. In the dampack
project, we just use them for one thing: to automatically run the unit tests and the CRAN checks for the dampack
package.
To motivate this, consider Steph and Tamika again. Let's say that Steph has been working on the master
branch of dampack
, and Tamika has a great idea. So, Tamika goes through the GitHub flow: she creates a new branch (great-idea
), pushes her changes, and opens a pull request. Now, how does Steph know that Tamika's idea doesn't break the rest of dampack
?
First, if Steph knows every single piece of code that the change could potentially affect, she could determine if the new feature would break anything. But that is unrealistic, especially for larger projects.
Second, Steph could fetch the great-idea
branch from Github, re-build the package, run all the tests, and request changes if things don't work. But this takes a lot of time! Imagine if Steph was reviewing 4-5 pull requests a day. That would be a lot of work.
Finally, what if the repository needs to be tested on Windows, macOS, and Linux? Coordinating all of that would be fairly difficult.
Continuous integration services, like Travis CI, automate this process. Travis CI is used for dampack
, and is well-supported on GitHub. In dampack
, with every push to every branch, Travis CI will run all of the unit tests and the CRAN checks on macOS and Linux (R on Windows isn't currently supported on Travis). That way, if you are reviewing a pull request, you'll be able to see if it plays well with the "canonical version" of the code (usually the master branch).