-
Notifications
You must be signed in to change notification settings - Fork 17
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #119 from greenelab/envest/update_readme
Envest/update readme
- Loading branch information
Showing
4 changed files
with
135 additions
and
61 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,113 +1,186 @@ | ||
# Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously | ||
|
||
The full output of a [version](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v1.1) of this analysis is available at Figshare under the DOI: [10.6084/m9.figshare.5035997.v2](https://doi.org/10.6084/m9.figshare.5035997.v2) | ||
<!-- START doctoc generated TOC please keep comment here to allow auto update --> | ||
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --> | ||
**Table of Contents** *generated with [DocToc](https://github.com/thlorenz/doctoc)* | ||
|
||
- [Summary](#summary) | ||
- [Requirements](#requirements) | ||
- [Obtaining and running the Docker container](#obtaining-and-running-the-docker-container) | ||
- [Download data from The Cancer Genome Atlas (TCGA)](#download-data-from-the-cancer-genome-atlas-tcga) | ||
- [Recreate manuscript results](#recreate-manuscript-results) | ||
- [Methods](#methods) | ||
- [Machine Learning Pipeline](#machine-learning-pipeline) | ||
- [Differential Expression Pipeline](#differential-expression-pipeline) | ||
- [Running individual experiments](#running-individual-experiments) | ||
- [Machine learning](#machine-learning) | ||
- [Differential expression](#differential-expression) | ||
- [Other scripts](#other-scripts) | ||
- [Manuscript versions](#manuscript-versions) | ||
- [Funding](#funding) | ||
|
||
<!-- END doctoc generated TOC please keep comment here to allow auto update --> | ||
|
||
## Summary | ||
|
||
We performed a series of supervised and unsupervised machine learning | ||
evaluations, as well as differential expression analyses, to assess which | ||
evaluations, as well as differential expression and pathway analyses, to assess which | ||
normalization methods are best suited for combining data from microarray and | ||
RNA-seq platforms. | ||
|
||
We evaluated five normalization approaches for all methods: | ||
We evaluated six normalization approaches for all methods: | ||
|
||
1. log-transformation (LOG) | ||
2. [non-paranormal transformation](https://arxiv.org/abs/0903.0649) (NPN) | ||
3. [quantile normalization](http://bmbolstad.com/misc/normalize/bolstad_norm_paper.pdf) (QN) | ||
4. [Training Distribution Matching](https://peerj.com/articles/1621/) (TDM) | ||
5. standardizing scores (z-scoring; Z). | ||
4. quantile normalization followed by z-scoring (QN-Z) | ||
5. [Training Distribution Matching](https://peerj.com/articles/1621/) (TDM) | ||
6. z-scoring (Z) | ||
|
||
A [version](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v1.0) of this project is detailed in our pre-print | ||
[Cross-Platform Normalization Enables Machine Learning Model Training On Microarray And RNA-Seq Data Simultaneously](https://doi.org/10.1101/118349). | ||
|
||
_We are actively making improvements to this codebase; see [#12](https://github.com/greenelab/RNAseq_titration_results/issues/12)._ | ||
|
||
## Breast Cancer Data | ||
|
||
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.58862.svg)](https://doi.org/10.5281/zenodo.58862) | ||
## Requirements | ||
|
||
We recommend using the docker image `envest/rnaseq_titration_results:R-4.1.2` to handle package and dependency installation. | ||
See `docker/R-4.1.2/Dockerfile` for more information. | ||
|
||
Our analysis ([v2.0](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v2.0)) was run using 7 cores on an AWS instance with 16 cores, 128 GB memory, and an allocated 1 TB of space. | ||
|
||
The Cancer Genome Atlas BRCA data used for these analyses | ||
### Obtaining and running the Docker container | ||
|
||
Pull the docker image using: | ||
|
||
``` | ||
docker pull envest/rnaseq_titration_results:R-4.1.2 | ||
``` | ||
|
||
Then run the command to start up a container, replacing `[PASSWORD]` with your own password: | ||
|
||
``` | ||
docker run --mount type=bind,target=/home/rstudio,source=$PWD -e PASSWORD=[PASSWORD] -p 8787:8787 envest/rnaseq_titration_results:R-4.1.2 | ||
``` | ||
|
||
Navigate to <http://localhost:8787/> and login to the RStudio server with the username `rstudio` and the password you set above. | ||
|
||
|
||
## Download data from The Cancer Genome Atlas (TCGA) | ||
|
||
TCGA data from 520 breast cancer (BRCA) patients used for these analyses | ||
is [available at zenodo](https://zenodo.org/record/58862). | ||
|
||
Data from 150 glioblastoma (GBM) patients is available from the [Genomic Data Commons PanCan Atlas](https://gdc.cancer.gov/about-data/publications/pancanatlas). | ||
|
||
To download data, run the data download script in the top directory: | ||
|
||
``` | ||
bash download_TCGA_data.sh | ||
``` | ||
|
||
## Recreate manuscript results | ||
|
||
After data has been downloaded, running | ||
|
||
``` | ||
# To download data, run in top directory: | ||
sh brca_data_download.sh | ||
bash run_all_analyses_and_plots.sh | ||
``` | ||
|
||
## Analysis | ||
with [v2.0](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v2.0) of this repository will reproduce the results presented in our manuscript. | ||
We recommend running all analyses within the project Docker container. | ||
|
||
## Methods | ||
|
||
### Machine Learning Pipeline | ||
|
||
Here's a schematic overview of our machine learning experiments: | ||
|
||
![](https://github.com/greenelab/RNAseq_titration_results/blob/master/diagrams/RNA-seq_titration_ML_overview.png) | ||
![](diagrams/RNA-seq_titration_ML_overview.png) | ||
|
||
**Overview of supervised and unsupervised machine learning experiments.** | ||
|
||
1. 520 TCGA Breast Cancer samples run on both microarray and RNA-seq were split | ||
into a training (2/3) and holdout set (1/3). | ||
2. RNA-seq’d samples were "titrated" into the training set, 10% at a time (0-100%) | ||
resulting in eleven training sets for each normalization method. | ||
3. _Machine learning applications._ Three supervised multi-class (BRCA PAM50 subtype) | ||
classifiers—LASSO, linear SVM, and Random Forest—were trained on each training set | ||
and tested on the microarray and RNA-seq holdout sets. The holdout sets were projected | ||
onto and back out of the training set space using two unsupervised techniques, Independent | ||
and Principal Components Analysis, to obtain reconstructed holdout sets. The | ||
classifiers used in step 4A above were used to predict on the reconstructed holdout | ||
sets. | ||
1. Matched samples run on both microarray and RNA-seq were split into a training (2/3) and holdout set (1/3). | ||
2. RNA-seq samples were "titrated" into the training set, 10% at a time (0-100%), replacing their matched array samples, resulting in eleven training sets for each normalization method. | ||
3. Machine learning applications: | ||
|
||
``` | ||
# To run the machine learning pipeline, run in top directory: | ||
sh run_machine_learning_experiments.sh | ||
- _Supervised learning_: | ||
We trained three classifiers – LASSO, linear SVM, and Random Forest — on each training set and tested them on the microarray and RNA-seq holdout sets. | ||
The models were trained to predict tumor subtype (both cancer types have 5 subtypes) and the binary mutation status of _TP53_ and _PIK3CA_. | ||
|
||
# To run one repeat of the subtype classifier pipeline, use: | ||
Rscript run_experiments.R | ||
``` | ||
- _Unsupervised learning_: | ||
We projected holdout sets onto and back out of the training set space using Principal Components Analysis to obtain reconstructed holdout sets. | ||
We then used the trained subtype classifiers to predict on the reconstructed holdout sets. | ||
[PLIER](https://github.com/wgmao/PLIER) (Pathway-Level Information ExtractoR) identified coordinated sets of genes in each cancer type. | ||
|
||
### Differential Expression Pipeline | ||
|
||
Here's a schematic overview of our main differential expression experiment: | ||
|
||
![](https://github.com/greenelab/RNAseq_titration_results/blob/master/diagrams/RNA-seq_titration_diff_expression_overview.png?raw=true) | ||
![](diagrams/RNA-seq_titration_diff_expression_overview.png) | ||
|
||
**Overview of differential expression experiment.** | ||
|
||
1. All matched TCGA breast cancer samples (n = 520) were considered when building the platform-specific | ||
“silver standards.” These standards are the genes that were differentially | ||
expressed at a specified False Discovery Rate (FDR) using data sets comprised | ||
entirely of one platform and processed in a standard way: log2-transformed | ||
microarray data and “untransformed” RSEM count data (preprocessed using the | ||
`limma::voom` function). | ||
2. RNA-seq’d samples were ‘titrated’ into the data set, | ||
10% at a time (0-100%) resulting in eleven experimental sets for each n | ||
ormalization method. | ||
3. Differentially expressed genes (DEGs) were identified using | ||
the `limma` package. We compared the Her2 and LumA subtypes as well as Basal | ||
v. all other samples. | ||
4. Lists of experimental DEGs were compared to standard gene | ||
sets using Jaccard similarity. | ||
1. All matched samples were considered when building the platform-specific “silver standards.” | ||
These standards are the genes that were differentially expressed at a specified False Discovery Rate (FDR) using data sets comprised entirely of one platform and processed in a standard way: log2-transformed | ||
microarray data and “untransformed” RNA-seq data. | ||
2. RNA-seq samples were "titrated" into the data set, 10% at a time (0-100%), replacing their matched array samples, resulting in eleven experimental sets for each normalization method. | ||
3. Differentially expressed genes (DEGs) were identified usingthe `limma` package. | ||
For BRCA, we compared the Her2 and LumA subtypes as well as Basal v. all other subtypes. | ||
For GBM, we compared the Classical and Mesenchymal subtypes as well as Proneural v. all other subtypes. | ||
4. Lists of experimental DEGs were compared to standard genesets using Jaccard similarity and Spearman rank correlation. | ||
|
||
In the "small n" experiment, between 3 and 50 samples were selected from each subtype for DEG comparison. | ||
|
||
|
||
## Running individual experiments | ||
|
||
#### Machine learning | ||
|
||
To run the machine learning pipeline, run in top directory: | ||
|
||
``` | ||
# Note: This requires the data to be processed to include matched samples only, | ||
# and split into training and test sets (0-expression_data_overlap_and_split.R) | ||
bash run_machine_learning_experiments.sh [cancer type] [prediction task] [n cores] | ||
``` | ||
|
||
where | ||
|
||
# To run the differential expression pipeline, run in top directory: | ||
sh run_differential_expression_experiments.sh | ||
- `[cancer type]` is `BRCA` or `GBM` | ||
- `[prediction task]` is `subtype`, `TP53`, or `PIK3CA` | ||
- `[n cores]` is the number of cores you want to run in parallel | ||
|
||
#### Differential expression | ||
|
||
⚠️ _This requires the data to be processed to include matched samples only, and split into training and test sets via `0-expression_data_overlap_and_split.R` in the machine learning pipeline._ | ||
|
||
To run the differential expression pipeline, run in top directory: | ||
|
||
``` | ||
bash run_differential_expression_experiments.sh [cancer type] [subtype vs others] [subtype vs subtype] [subtype vs subtype small] [n cores] | ||
``` | ||
|
||
## Requirements | ||
where | ||
|
||
- `[cancer type]` is `BRCA` or `GBM` | ||
- `[subtype vs others]` is the subtype to be compared against all other subtypes | ||
- `[subtype vs subtype]` are the two subtypes to be compared (comma-separated, e.g. `Her2,LumA`) | ||
- `[subtype vs subtype small]` are the two subtypes to be compared at small sample sizes (comma-separated, e.g. `Her2,LumA`) | ||
- `[n cores]` is the number of cores you want to run in parallel | ||
|
||
#### Other scripts | ||
|
||
This analysis was performed in R. It requires R & Bioconductor packages | ||
detailed in `check_installs.R` to be installed. | ||
To search for the number of publicly available microarray and RNA-seq samples from [GEO](https://www.ncbi.nlm.nih.gov/geo/) and [ArrayExpress](https://www.ebi.ac.uk/arrayexpress/), run | ||
|
||
One github package (`TDM`) is required. To install, run: | ||
``` | ||
python3 search_geo_arrayexpress.py | ||
``` | ||
and check the output in `results/array_rnaseq_ratio`. | ||
|
||
library(devtools) | ||
devtools::install_github("greenelab/TDM") | ||
## Manuscript versions | ||
|
||
**This analysis is [in the process](https://github.com/greenelab/RNAseq_titration_results/issues/18) of being moved to a Docker image.** | ||
| Version | Relevant links | | ||
| :------ | :------------- | | ||
| [v2.0](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v2.0) | [Figshare+ data](https://doi.org/10.25452/figshare.plus.19629864.v1), [Data for plots](https://doi.org/10.6084/m9.figshare.19686453) | | ||
| [v1.1](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v1.1) | [Figshare full results](https://doi.org/10.6084/m9.figshare.5035997.v2) | | ||
| [v1.0](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v1.0) | [Pre-print](https://doi.org/10.1101/118349) | | ||
|
||
## Funding | ||
|
||
This work was supported the Gordon and Betty Moore Foundation [GBMF 4552] and | ||
the National Institutes of Health [T32-AR007442, U01-TR001263]. | ||
This work was supported by the Gordon and Betty Moore Foundation [GBMF 4552], Alex's Lemonade Stand Foundation [GR-000002471], and the National Institutes of Health [T32-AR007442, U01-TR001263, R01-CA237170, K12GM081259]. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.