Skip to content

Commit

Permalink
Merge pull request #119 from greenelab/envest/update_readme
Browse files Browse the repository at this point in the history
Envest/update readme
  • Loading branch information
envest authored May 2, 2022
2 parents 4852bec + 64c9817 commit 21d5aed
Show file tree
Hide file tree
Showing 4 changed files with 135 additions and 61 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ models
results
plots/main/*.pdf
plots/supplementary/*.pdf
plots/visualize_expression/*.pdf
.Rproj.user
RNAseq_titration_results.Rproj
._RNAseq_titration_results.Rproj
Expand Down
195 changes: 134 additions & 61 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,113 +1,186 @@
# Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously

The full output of a [version](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v1.1) of this analysis is available at Figshare under the DOI: [10.6084/m9.figshare.5035997.v2](https://doi.org/10.6084/m9.figshare.5035997.v2)
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
**Table of Contents** *generated with [DocToc](https://github.com/thlorenz/doctoc)*

- [Summary](#summary)
- [Requirements](#requirements)
- [Obtaining and running the Docker container](#obtaining-and-running-the-docker-container)
- [Download data from The Cancer Genome Atlas (TCGA)](#download-data-from-the-cancer-genome-atlas-tcga)
- [Recreate manuscript results](#recreate-manuscript-results)
- [Methods](#methods)
- [Machine Learning Pipeline](#machine-learning-pipeline)
- [Differential Expression Pipeline](#differential-expression-pipeline)
- [Running individual experiments](#running-individual-experiments)
- [Machine learning](#machine-learning)
- [Differential expression](#differential-expression)
- [Other scripts](#other-scripts)
- [Manuscript versions](#manuscript-versions)
- [Funding](#funding)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->

## Summary

We performed a series of supervised and unsupervised machine learning
evaluations, as well as differential expression analyses, to assess which
evaluations, as well as differential expression and pathway analyses, to assess which
normalization methods are best suited for combining data from microarray and
RNA-seq platforms.

We evaluated five normalization approaches for all methods:
We evaluated six normalization approaches for all methods:

1. log-transformation (LOG)
2. [non-paranormal transformation](https://arxiv.org/abs/0903.0649) (NPN)
3. [quantile normalization](http://bmbolstad.com/misc/normalize/bolstad_norm_paper.pdf) (QN)
4. [Training Distribution Matching](https://peerj.com/articles/1621/) (TDM)
5. standardizing scores (z-scoring; Z).
4. quantile normalization followed by z-scoring (QN-Z)
5. [Training Distribution Matching](https://peerj.com/articles/1621/) (TDM)
6. z-scoring (Z)

A [version](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v1.0) of this project is detailed in our pre-print
[Cross-Platform Normalization Enables Machine Learning Model Training On Microarray And RNA-Seq Data Simultaneously](https://doi.org/10.1101/118349).

_We are actively making improvements to this codebase; see [#12](https://github.com/greenelab/RNAseq_titration_results/issues/12)._

## Breast Cancer Data

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.58862.svg)](https://doi.org/10.5281/zenodo.58862)
## Requirements

We recommend using the docker image `envest/rnaseq_titration_results:R-4.1.2` to handle package and dependency installation.
See `docker/R-4.1.2/Dockerfile` for more information.

Our analysis ([v2.0](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v2.0)) was run using 7 cores on an AWS instance with 16 cores, 128 GB memory, and an allocated 1 TB of space.

The Cancer Genome Atlas BRCA data used for these analyses
### Obtaining and running the Docker container

Pull the docker image using:

```
docker pull envest/rnaseq_titration_results:R-4.1.2
```

Then run the command to start up a container, replacing `[PASSWORD]` with your own password:

```
docker run --mount type=bind,target=/home/rstudio,source=$PWD -e PASSWORD=[PASSWORD] -p 8787:8787 envest/rnaseq_titration_results:R-4.1.2
```

Navigate to <http://localhost:8787/> and login to the RStudio server with the username `rstudio` and the password you set above.


## Download data from The Cancer Genome Atlas (TCGA)

TCGA data from 520 breast cancer (BRCA) patients used for these analyses
is [available at zenodo](https://zenodo.org/record/58862).

Data from 150 glioblastoma (GBM) patients is available from the [Genomic Data Commons PanCan Atlas](https://gdc.cancer.gov/about-data/publications/pancanatlas).

To download data, run the data download script in the top directory:

```
bash download_TCGA_data.sh
```

## Recreate manuscript results

After data has been downloaded, running

```
# To download data, run in top directory:
sh brca_data_download.sh
bash run_all_analyses_and_plots.sh
```

## Analysis
with [v2.0](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v2.0) of this repository will reproduce the results presented in our manuscript.
We recommend running all analyses within the project Docker container.

## Methods

### Machine Learning Pipeline

Here's a schematic overview of our machine learning experiments:

![](https://github.com/greenelab/RNAseq_titration_results/blob/master/diagrams/RNA-seq_titration_ML_overview.png)
![](diagrams/RNA-seq_titration_ML_overview.png)

**Overview of supervised and unsupervised machine learning experiments.**

1. 520 TCGA Breast Cancer samples run on both microarray and RNA-seq were split
into a training (2/3) and holdout set (1/3).
2. RNA-seq’d samples were "titrated" into the training set, 10% at a time (0-100%)
resulting in eleven training sets for each normalization method.
3. _Machine learning applications._ Three supervised multi-class (BRCA PAM50 subtype)
classifiers—LASSO, linear SVM, and Random Forest—were trained on each training set
and tested on the microarray and RNA-seq holdout sets. The holdout sets were projected
onto and back out of the training set space using two unsupervised techniques, Independent
and Principal Components Analysis, to obtain reconstructed holdout sets. The
classifiers used in step 4A above were used to predict on the reconstructed holdout
sets.
1. Matched samples run on both microarray and RNA-seq were split into a training (2/3) and holdout set (1/3).
2. RNA-seq samples were "titrated" into the training set, 10% at a time (0-100%), replacing their matched array samples, resulting in eleven training sets for each normalization method.
3. Machine learning applications:

```
# To run the machine learning pipeline, run in top directory:
sh run_machine_learning_experiments.sh
- _Supervised learning_:
We trained three classifiers – LASSO, linear SVM, and Random Forest — on each training set and tested them on the microarray and RNA-seq holdout sets.
The models were trained to predict tumor subtype (both cancer types have 5 subtypes) and the binary mutation status of _TP53_ and _PIK3CA_.

# To run one repeat of the subtype classifier pipeline, use:
Rscript run_experiments.R
```
- _Unsupervised learning_:
We projected holdout sets onto and back out of the training set space using Principal Components Analysis to obtain reconstructed holdout sets.
We then used the trained subtype classifiers to predict on the reconstructed holdout sets.
[PLIER](https://github.com/wgmao/PLIER) (Pathway-Level Information ExtractoR) identified coordinated sets of genes in each cancer type.

### Differential Expression Pipeline

Here's a schematic overview of our main differential expression experiment:

![](https://github.com/greenelab/RNAseq_titration_results/blob/master/diagrams/RNA-seq_titration_diff_expression_overview.png?raw=true)
![](diagrams/RNA-seq_titration_diff_expression_overview.png)

**Overview of differential expression experiment.**

1. All matched TCGA breast cancer samples (n = 520) were considered when building the platform-specific
“silver standards.” These standards are the genes that were differentially
expressed at a specified False Discovery Rate (FDR) using data sets comprised
entirely of one platform and processed in a standard way: log2-transformed
microarray data and “untransformed” RSEM count data (preprocessed using the
`limma::voom` function).
2. RNA-seq’d samples were ‘titrated’ into the data set,
10% at a time (0-100%) resulting in eleven experimental sets for each n
ormalization method.
3. Differentially expressed genes (DEGs) were identified using
the `limma` package. We compared the Her2 and LumA subtypes as well as Basal
v. all other samples.
4. Lists of experimental DEGs were compared to standard gene
sets using Jaccard similarity.
1. All matched samples were considered when building the platform-specific “silver standards.”
These standards are the genes that were differentially expressed at a specified False Discovery Rate (FDR) using data sets comprised entirely of one platform and processed in a standard way: log2-transformed
microarray data and “untransformed” RNA-seq data.
2. RNA-seq samples were "titrated" into the data set, 10% at a time (0-100%), replacing their matched array samples, resulting in eleven experimental sets for each normalization method.
3. Differentially expressed genes (DEGs) were identified usingthe `limma` package.
For BRCA, we compared the Her2 and LumA subtypes as well as Basal v. all other subtypes.
For GBM, we compared the Classical and Mesenchymal subtypes as well as Proneural v. all other subtypes.
4. Lists of experimental DEGs were compared to standard genesets using Jaccard similarity and Spearman rank correlation.

In the "small n" experiment, between 3 and 50 samples were selected from each subtype for DEG comparison.


## Running individual experiments

#### Machine learning

To run the machine learning pipeline, run in top directory:

```
# Note: This requires the data to be processed to include matched samples only,
# and split into training and test sets (0-expression_data_overlap_and_split.R)
bash run_machine_learning_experiments.sh [cancer type] [prediction task] [n cores]
```

where

# To run the differential expression pipeline, run in top directory:
sh run_differential_expression_experiments.sh
- `[cancer type]` is `BRCA` or `GBM`
- `[prediction task]` is `subtype`, `TP53`, or `PIK3CA`
- `[n cores]` is the number of cores you want to run in parallel

#### Differential expression

⚠️ _This requires the data to be processed to include matched samples only, and split into training and test sets via `0-expression_data_overlap_and_split.R` in the machine learning pipeline._

To run the differential expression pipeline, run in top directory:

```
bash run_differential_expression_experiments.sh [cancer type] [subtype vs others] [subtype vs subtype] [subtype vs subtype small] [n cores]
```

## Requirements
where

- `[cancer type]` is `BRCA` or `GBM`
- `[subtype vs others]` is the subtype to be compared against all other subtypes
- `[subtype vs subtype]` are the two subtypes to be compared (comma-separated, e.g. `Her2,LumA`)
- `[subtype vs subtype small]` are the two subtypes to be compared at small sample sizes (comma-separated, e.g. `Her2,LumA`)
- `[n cores]` is the number of cores you want to run in parallel

#### Other scripts

This analysis was performed in R. It requires R & Bioconductor packages
detailed in `check_installs.R` to be installed.
To search for the number of publicly available microarray and RNA-seq samples from [GEO](https://www.ncbi.nlm.nih.gov/geo/) and [ArrayExpress](https://www.ebi.ac.uk/arrayexpress/), run

One github package (`TDM`) is required. To install, run:
```
python3 search_geo_arrayexpress.py
```
and check the output in `results/array_rnaseq_ratio`.

library(devtools)
devtools::install_github("greenelab/TDM")
## Manuscript versions

**This analysis is [in the process](https://github.com/greenelab/RNAseq_titration_results/issues/18) of being moved to a Docker image.**
| Version | Relevant links |
| :------ | :------------- |
| [v2.0](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v2.0) | [Figshare+ data](https://doi.org/10.25452/figshare.plus.19629864.v1), [Data for plots](https://doi.org/10.6084/m9.figshare.19686453) |
| [v1.1](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v1.1) | [Figshare full results](https://doi.org/10.6084/m9.figshare.5035997.v2) |
| [v1.0](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v1.0) | [Pre-print](https://doi.org/10.1101/118349) |

## Funding

This work was supported the Gordon and Betty Moore Foundation [GBMF 4552] and
the National Institutes of Health [T32-AR007442, U01-TR001263].
This work was supported by the Gordon and Betty Moore Foundation [GBMF 4552], Alex's Lemonade Stand Foundation [GR-000002471], and the National Institutes of Health [T32-AR007442, U01-TR001263, R01-CA237170, K12GM081259].
Binary file modified diagrams/RNA-seq_titration_ML_overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified diagrams/RNA-seq_titration_diff_expression_overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 21d5aed

Please sign in to comment.