hw06_puttingAllTogether.rmd

Go back to [STAT545A home](current.html)

Homework #6 The End: Putting it all together 
========================================================

> Follow the [existing homework submission instructions](hw00_instructions.html) __as much as possible__, e.g. use systematic, informative filenames. By 12pm Monday October 21, be prepared to share a link via a Google doc. I will announce sharing on the Google Group as usual.

### Big picture

  * Write at least two R scripts to enact an analysis
  * Output of the first script must be (an) input of the second (and so on ...)
  * Write numerical data to file; file must be readable by a human and by someone who doesn't use R
  * Write figure(s) to file
  * Provide evidence of the following:
    - Initial state of directory, containing data and R scripts
    - How you run your 2 (or more) scripts *non-interactively*
    - End state of the directory, containing newly written numerical and graphical output
  * Provide the files themselves and a description of what to do with them
  
### Templates

  * Bare bones example using only R: [hw06_scaffolds/01_justR](https://github.com/jennybc/STAT545A/tree/master/hw06_scaffolds/01_justR)
  * Example upgraded to include using Make to run the pipeline: [hw06_scaffolds/02_rAndMake](https://github.com/jennybc/STAT545A/tree/master/hw06_scaffolds/02_rAndMake)
  * Example compiling an R script and an R Markdown file to HTML via the makefile: [hw06_scaffolds/03_knitWithoutRStudio](https://github.com/jennybc/STAT545A/tree/master/hw06_scaffolds/03_knitWithoutRStudio)

### Please just tell me what to do

If you don't feel like dreaming up your own thing, here's a Gapminder blueprint that is a minimal but respectable way to complete the assignment. You are welcome to remix R code already written by someone, student or JB, in this class, but credit/link appropriately, e.g. in comments. 

JB has provided an template, using a different dataset, [here](https://github.com/jennybc/STAT545A/tree/master/hw06_scaffolds/01_justR), that should help make this concrete.

  * Write a suitably named R script:  
  
    - Import the data.  
    - Save a couple descriptive plots to file with highly informative names.  
    - Reorder the continents based on life expectancy. You decide the details.  
    - Sort the actual data in a deliberate fashion. You decide the details, but this should *at least* implement your new continent ordering.  
    - Write the Gapminder data to file(s), for immediate and future reuse.  
    
  * Write a second suitably named R script:
  
    - Import the data created in the first script.  
    - Make sure your new continent order is still in force. You decide the details.  
    - Fit a linear regression of life expectancy on year within each country. Write the estimated intercepts, slopes, and residual error variance (or sd) to file.  
    - Find the 3 or 4 "worst" and "best" countries for each continent. You decide the details.  
    - Write the linear regression info for just these countries to file.  
    - Create a single-page figure for each continent, including data only for the 6-8 "extreme" countries, and write to file. One file per continent, with an informative name. The figure should give scatterplots of life expectancy vs. year, panelling/facetting on country, fitted line overlaid.  
    
  * Identify and test a method of running your pipeline non-interactively.
  
    - You could write a master R script that simply `source()`s the two scripts, one after the other. Tip: you will probably want a second "clean up / reset" script that deletes all the output your scripts leave behind, so you can easily test and refine your strategy, i.e. without repeatedly  deletng stuff "by hand". You can run the master script or the cleaning script from a shell with `R CMD BATCH` or `Rscript`.

    > Dear Windows people: Recent versions of Windws have PowerShell, whose syntax resembles the Unix-type shells the Mac/Linux folks will have automatically. Try using that. See below for links to documentation on PowerShell and for help getting other software.
    
  * Provide a link to a page that explains how your pipeline works and links to the remaining files. Jenny and Song should be able to go to this landing page and re-run your analysis quickly and easily.
  
    - Minimum: I think you can accomplish this with RPubs and Gist, with a carefully constructed README-type file that links to the other files.
    - Alternatively, you could put this all in a directory on the web somewhere and submit a link to file that serves as landing page. In that case, it is a very nice touch to offer the files in a browsable form AND as an easily downloaded archive.

### I want to aim higher!

Follow the basic Gapminder blueprint above, but find a different data aggregation task, different panelling/facetting emphasis, focus on different variables, etc.

Use non-Gapminder data.

  * This means you'll need to spend more time on data cleaning and sanity checking. You will probably have an entire script (or more!) devoted to data prep. Examples:
    - Are there wonky factors? Consider bringing in as character, addressing their deficiencies, then converting to factor.
    - Are there variables you're just willing to drop at this point? Do it!
    - Are there dates and times that need special handling? Do it!
    - Are there annoying observations that require very special handling or crap up your figures (e.g. Oceania)? Drop them!
    
Include some dynamic report generation in your pipeline. That is, create HTML from one or more plain R or R markdown files.

  * Example of how to emulate RStudio's "Compile Notebook" button from a shell:
  
    - `Rscript -e "knitr::stitch_rmd('myAwesomeScript.R')"`
    
  * To emulate "Knit HTML", do something like the above but with `knitr::knit2html()`.
  
  * See [the Makefile in this example](https://github.com/jennybc/STAT545A/blob/master/hw06_scaffolds/03_knitWithoutRStudio/Makefile) to see these commands in action

Experiment with running R code saved in a script from within R Markdown. Here's some official documentation on [code externalization](http://yihui.name/knitr/demo/externalization/).

Embed pre-existing figures in and R Markdown docuement, i.e. an R script creates the figures, then the report incorporates them. General advice on writing figures to file is [here](topic12_writeFigureToFile.html) and `ggplot2` has a purpose-built function `ggsave()` you should try. See an example of this in [an R Markdown file in one of the examples](https://github.com/jennybc/STAT545A/blob/master/hw06_scaffolds/03_knitWithoutRStudio/03_doStuff.Rmd).

Import pre-existing data in an R Markdown document, then format nicely as a table.

Use Pandoc and/or LaTeX to explore new territory in document compilation. You could use Pandoc as an alternative to `knitr` for Markdown to HTML conversion; you'd still use `knitr` for conversion of R Markdown to Markdown. You would use LaTeX to get PDF output from R Markdown.

Start using Git! Create a new Git repository for all the files for this assignment. See below for help.

Start using Github! Share your repo with us on the web! See below for help.

Use `Make` to run your pipeline. See below for help. Also demonstrated in the example [hw06_scaffolds/02_rAndMake](https://github.com/jennybc/STAT545A/tree/master/hw06_scaffolds/02_rAndMake) and in the example [hw06_scaffolds/03_knitWithoutRStudio](https://github.com/jennybc/STAT545A/tree/master/hw06_scaffolds/03_knitWithoutRStudio)

### How to install and use recommended tools

  * [Setup instructions for Software Carpentry R Bootcamp, Canberra, October 9 - 10 2013](http://swcarpentry.github.io/2013-10-09-canberra/lessons/setup.html) <-- has good info on GitBash, a combined Bash shell + git install for Windows
  * [Additional software installation instructions](https://github.com/swcarpentry/2013-10-09-canberra/blob/master/additional_software.md) from above SWC Bootcamp
  * [Microsoft info on PowerShell](http://technet.microsoft.com/en-us/library/hh848793)
  * [An introduction to Git/Github](http://kbroman.github.io/github_tutorial/) by Karl Broman, aimed at stats / data science types
  * JB __highly recommends__ the free [SourceTree](http://www.sourcetreeapp.com) Git client. Available for Windows or Mac.
  * Ram, K 2013. Git can facilitate greater reproducibility and increased transparency in science. Source Code for Biology and Medicine 2013 8:7. Go to the [associated github repo](https://github.com/karthikram/smb_git) to get the PDF (link at bottom of README) and to see a great example of how someone managed the process of writing a manuscript with git(hub).
  * Blog post [Version control for scientific research](http://blogs.biomedcentral.com/bmcblog/2013/02/28/version-control-for-scientific-research/) by Karthik Ram and Titus Brown on the BioMed Central blog February 28
  * Blog post [Getting Started With a GitHub Repository](http://chronicle.com/blogs/profhacker/getting-started-with-a-github-repository) from ProfHacker on chronicle.com looks helpful
  * Need a chuckle? [xkcd on Git commit messages](http://xkcd.com/1296/)
  * [An introduction to `Make`](http://kbroman.github.io/minimal_make/) by Karl Broman, aimed at stats / data science types  
  * Blog post [Using Make for reproducible scientific analyses](http://www.bendmorris.com/2013/09/using-make-for-reproducible-scientific.html) by Ben Morris
  * [Slides on `Make`](http://software-carpentry.org/v4/make/index.html) from Software Carpentry
  * Mike Bostock on [Why Use Make](http://bost.ocks.org/mike/make/)
  * Zachary M. Jones on [GNU Make for Reproducible Data Analysis](http://zmjones.com/make.html)
  * Ben Bolker RPub [Literate programming, version control, reproducible research, collaboration, and all that](http://rpubs.com/bbolker/3153)
  * A set-up free way to play with Git: [Code School - Try Git](http://try.github.io/levels/1/challenges/1)
  
<div class="footer">
This work is licensed under the  <a href="http://creativecommons.org/licenses/by-nc/3.0/">CC BY-NC 3.0 Creative Commons License</a>.
</div>