index.Rmd

---
title: "derfinder counting paper Supplementary Website"
author: "L Collado-Torres"
date: "`r doc_date()`"
output: 
  BiocStyle::html_document
---

```{r citationsSetup, echo=FALSE, message=FALSE, warning=FALSE}
## Track time spent on making the report
startTime <- Sys.time()

## Bib setup
library('knitcitations')

## Load knitcitations with a clean bibliography
cleanbib()
cite_options(hyperlink = 'to.doc', citation_format = 'text', style = 'html')
# Note links won't show for now due to the following issue
# https://github.com/cboettig/knitcitations/issues/63

bibs <- c("knitcitations" = citation("knitcitations"),
    "derfinder" = citation("derfinder")[1],
    "regionReport" = citation("regionReport")[1],
    "GenomicRanges" = citation("GenomicRanges"),
    "DESeq2" = citation("DESeq2"),
    "edgeR" = citation("edgeR")[5],
    "BiocStyle" = citation("BiocStyle"),
    'rmarkdown' = citation('rmarkdown'),
    'knitr' = citation('knitr')[3],
    'eff' = RefManageR::BibEntry('manual', key = 'eff', title = 'Efficiency analysis of Sun Grid Engine batch jobs', author = 'Alyssa Frazee', year = 2014, url = 'http://dx.doi.org/10.6084/m9.figshare.878000'),
    'zhou2011' = RefManageR::BibEntry('article', key = 'zhou2011', author = "Zhou, Zhifeng and Yuan, Qiaoping and Mash, Deborah C and Goldman, David", title = "Substance-specific and shared transcription and epigenetic changes in the human hippocampus chronically exposed to cocaine and alcohol", journal = "Proceedings of the National Academy of Sciences of the United States of America", year = 2011, volume = "108", number = "16", pages = "6626-6631"),
    'rail' = RefManageR::BibEntry('article', key = 'rail', author = 'Abhinav Nellore and Leonardo Collado-Torres and Andrew E. Jaffe and José Alquicira-Hernández and Jacob Pritt and James Morton and Jeffrey T. Leek  and Ben Langmead', journal = 'bioRxiv', year = '2015', title = 'Rail-RNA: {Scalable} analysis of {RNA}-seq splicing and coverage'),
    'stringtie' = RefManageR::BibEntry('article', key = 'stringtie', author = ' Mihaela Pertea and Geo M. Pertea and Corina M. Antonescu and Tsung-Cheng Chang and Joshua T. Mendell and Steven L. Salzberg', journal = 'Nature Biotechnology', year = '2015', title = 'StringTie enables improved reconstruction of a transcriptome from RNA-seq reads'),
    'hisat' = RefManageR::BibEntry('article', key = 'hisat', author = 'Daehwan Kim and Ben Langmead and Steven L Salzberg', journal = 'Nature Methods', year = '2015', title = 'HISAT: a fast spliced aligner with low memory requirements'),
    'ballgown' = RefManageR::BibEntry('article', key = 'ballgown', author = 'Alyssa C. Frazee and Geo Pertea and Andrew E. Jaffe and Ben Langmead and Steven L. Salzberg  and Jeffrey T. Leek', journal = 'Nature Biotechnology', year = '2015', title = 'Ballgown bridges the gap between transcriptome assembly and expression analysis'))
    
write.bibtex(bibs, file = 'index.bib')
bib <- read.bibtex('index.bib')

## Assign short names
names(bib) <- names(bibs)
```

This page describes the supplementary material for the `derfinder` counting paper. All the `bash`, `R` and `R Markdown` source files used to analyze the data for this project as well as generate the HTML reports are available in this website. However, it is easier to view them at [github.com/leekgroup/derCountSupp](https://github.com/leekgroup/derCountSupp).

# Hippocampus and time-course data sets

This section of the website describes the code and reports associated with the hippocampus and time-course data sets that are referred to in the paper.


## Code to reproduce analyses


There are 9 main `bash` scripts named _step1-*_ through _step9-*_. 

1. _fullCoverage_ loads the data from the raw files. See [step1-fullCoverage.sh](step1-fullCoverage.sh) and [step1-fullCoverage.R](step1-fullCoverage.R).
1. _makeModels_ creates the models used for the single base-level analysis.  See [step2-makeModels.sh](step2-makeModels.sh) and [step2-makeModels.R](step2-makeModels.R).
1. _analyzeChr_ runs the single base-level analysis by chromosome.  See [step3-analyzeChr.sh](step3-analyzeChr.sh) and [step3-analyzeChr.R](step3-analyzeChr.R).
1. _mergeResults_ merges the single base-level analysis results for all the chromosomes. See [step4-mergeResults.sh](step4-mergeResults.sh).
1. _derfinderReport_ generates a HTML report for the single base-level DERs. See [step5-derfinderReport.sh](step5-derfinderReport.sh).
1. _regionMatrix_ identifies the expressed regions for the expressed regions-level approach. See [step6-regionMatrix.sh](step6-regionMatrix.sh).
1. _regMatVsDERs_ creates a simple HTML report comparing the single-base and expressed-regions approaches. See [step7-regMatVsDERs.sh](step7-regMatVsDERs.sh) and [step7-regMatVsDERs.Rmd](step7-regMatVsDERs.Rmd).
1. _coverageToExon_ creates an exon count table using known annotation information. See [step8-coverageToExon.sh](step8-coverageToExon.sh) and [step8-coverageToExon.R](step8-coverageToExon.R).
1. _summaryInfo_ creates a HTML report with brief summary information for the given experiment. See [step9-summaryInfo.sh](step9-summaryInfo.sh), [step9-summaryInfo.R](step9-summaryInfo.R), and [step9-summaryInfo.Rmd](step9-summaryInfo.Rmd).

There are also 3 optional `bash` scripts used when BAM files are available.

1. _sortSam_ creates sorted by sequence name SAM files. See [optional1-sortSam.sh](optional1-sortSam.sh).
1. _HTSeq_ creates the exon count tables using `HTSeq`. See [optional2-HTSeq.sh](optional2-HTSeq.sh).
1. _summOv_ uses `GenomicRanges` to create the exon count tables. See [optional3-summOv.sh](optional3-summOv.sh) and [optional3-summOv.R](optional3-summOv.R).
1. _featureCounts_

A final `bash` script, [run-all.sh](run-all.sh), can be used to run the main 9 steps (or a subset of them).

All scripts show at the beginning the way they were used. Some of them generate intermediate small `bash` scripts, for example one script per chromosome for the _analyzeChr_ step. For some steps, there is a companion `R` or `R Markdown` code file when the code is more involved or an HTML file is generated in the particular step.


The [check-analysis-time.R](check-analysis-time.R) script was useful for checking the progress of the _step3-analyzeChr_ jobs and detect whenever a node in the cluster was presenting problems.


We expect that these scripts will be useful to `derfinder` users who want to automate the single base-level and/or expressed regions-level analyses for several data sets and/or have the jobs run automatically without having to check if each step has finished running.


Note that all `bash` scripts are tailored for the cluster we have access to which administer job queues with Sun Grid Engine (SGE).


## Single base-level

### Quick overview HTML report

This HTML report contains basic information on the `derfinder` `r citep(bib[["derfinder"]])` results from the _Hippo_ data set. The report answers basic questions such as:

* What is the number of filtered bases?
* What is the number of candidate regions?
* How many candidate regions are significant?

It also illustrates three clusters of candidate differentially expressed regions (DERs) from the single base-level analysis. You can view the report by following this link:

* [Hippo](hippo/summaryInfo/run3-v1.0.10/summaryInfo.html)

### CSV files and annotation comparison

This HTML report has the code for loading the R data files and generating the CSV files. The report also has Venn diagrams showing the number of candidate DERs from the single base-level analysis that overlap known exons, introns and intergenic regions using the UCSC hg19 annotation. It also includes a detailed description of the columns in the CSV files.

View the [venn](venn/venn.html) report or its `R Markdown` source file [venn.Rmd](venn/venn.Rmd). 


## Timing and memory information


This HTML report has code for reading and processing the time and memory information for each job extracted with [efficiency_analytics](https://github.com/alyssafrazee/efficiency_analytics) `r citep(bib[["eff"]])`. The report contains a detailed description of the analysis steps and tables summarizing the maximum memory and time for each analysis step if all the jobs for that particular step were running simultaneously. Finally, there is an interactive table with the timing results.

View the [timing](timing/timing.html) report or check the `R Markdown` file [timing.Rmd](timing/timing.Rmd).


# Hippo vs previous results

[compareVsPNAS](hippo/pnas/compareVsPNAS.html) is an HTML report comparing 29 regions that were previously found to be differentially expressed `r citep(bib[["zhou2011"]])` versus the `derfinder` single base-level results. It also has code for identified differentially expressed disjoint exons. The additional script [counts-gene.R](hippo/counts-gene/counts-gene.R) has the code for gene counting with `summarizeOverlaps()`. [compareVsPNAS-gene](hippo/pnas/compareVsPNAS-gene.html) compares the results between `DESeq2` and `edgeR`-robust against `derfinder` at the gene level with 40 total plots: 10 for each case of agreement/disagreement. 

View the [compareVsPNAS](hippo/pnas/compareVsPNAS.html) report or check the `R Markdown` file [compareVsPNAS.Rmd](hippo/pnas/compareVsPNAS.Rmd) run by the [runComparison.sh](hippo/pnas/runComparison.sh) script. Also view the [compareVsPNAS-gene](hippo/pnas/compareVsPNAS-gene.html) report and its linked `R Markdown` file [compareVsPNAS-gene.Rmd](hippo/pnas/compareVsPNAS-gene.Rmd).


# Additional analyses

The following `R` source files have the code for reproducing additional analyses described in the paper

* [feature_counts.R](additional-analyses/feature_counts.R) Feature counts analysis of Hippo and Snyder data sets.
* [createGFF.R](gff/createGFF.R) and [runGFF.sh](gff/runGFF.sh) create the GFF file used the script [optional2-HTSeq.sh](optional2-HTSeq.sh).

This scripts also include other exploratory code.

# Reproducibility

Date this page was generated.

```{r reproducibility1, echo=FALSE}
## Date the report was generated
Sys.time()
```

Wallclock time spent generating the report.

```{r "reproducibility2", echo=FALSE}
## Processing time in seconds
totalTime <- diff(c(startTime, Sys.time()))
round(totalTime, digits=3)
```

`R` session information.

```{r "reproducibility3", echo=FALSE}
## Session info
options(width=120)
devtools::session_info()
```

You can view the source `R Markdown` file for this page at [index.Rmd](index.Rmd).

# Bibliography

This report was generated using `BiocStyle` `r citep(bib[['BiocStyle']])`
with `knitr` `r citep(bib[['knitr']])` and `rmarkdown` `r citep(bib[['rmarkdown']])` running behind the scenes.  

Citations were made with  `knitcitations` `r citep(bib[['knitcitations']])`. Citation file: [index.bib](index.bib).

```{r vignetteBiblio, results = 'asis', echo = FALSE, warning = FALSE}
## Print bibliography
bibliography()
```