Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously
Published article: Foltz, S. M., Greene, C. S. & Taroni, J. N. Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously. Commun Biol 6, 222 (2023). https://doi.org/10.1038/s42003-023-04588-6
Table of Contents generated with DocToc
- Summary
- Requirements
- Download data from The Cancer Genome Atlas (TCGA)
- Recreate manuscript results
- Methods
- Running individual experiments
- Manuscript versions
- Funding
We performed a series of supervised and unsupervised machine learning evaluations, as well as differential expression and pathway analyses, to assess which normalization methods are best suited for combining data from microarray and RNA-seq platforms.
We evaluated seven normalization approaches for all methods:
- log-transformation (LOG)
- non-paranormal transformation (NPN)
- quantile normalization (QN)
- quantile normalization via CrossNorm
- quantile normalization followed by z-scoring (QN-Z)
- Training Distribution Matching (TDM)
- z-scoring (Z)
We also explored the use of Seurat to normalize array and RNA-seq data. Due to low sample numbers at the edges of our titration protocol, many experimental conditions could not be integrated.
We recommend using the docker image envest/rnaseq_titration_results:R-4.1.2
to handle package and dependency installation.
See docker/R-4.1.2/Dockerfile
for more information.
Our analysis (v2.3) was run using 7 cores on an AWS instance with 16 cores, 128 GB memory, and an allocated 1 TB of space.
Pull the docker image using:
docker pull envest/rnaseq_titration_results:R-4.1.2
Then run the command to start up a container, replacing [PASSWORD]
with your own password:
docker run --mount type=bind,target=/home/rstudio,source=$PWD -e PASSWORD=[PASSWORD] -p 8787:8787 envest/rnaseq_titration_results:R-4.1.2
Navigate to http://localhost:8787/ and login to the RStudio server with the username rstudio
and the password you set above.
TCGA data from 520 breast cancer (BRCA) patients used for these analyses is available at zenodo.
Data from 150 glioblastoma (GBM) patients is available from the Genomic Data Commons PanCan Atlas.
To download data, run the data download script in the top directory:
bash download_TCGA_data.sh
After data has been downloaded, running
bash run_all_analyses_and_plots.sh [cancer type]
where
[cancer type]
isboth
,BRCA
orGBM
with v2.3 of this repository will reproduce the results presented in our manuscript. We recommend running all analyses within the project Docker container.
Here's a schematic overview of our machine learning experiments:
Overview of supervised and unsupervised machine learning experiments.
- Matched samples run on both microarray and RNA-seq were split into a training (2/3) and holdout set (1/3).
- RNA-seq samples were "titrated" into the training set, 10% at a time (0-100%), replacing their matched array samples, resulting in eleven training sets for each normalization method.
- Machine learning applications:
-
Supervised learning: We trained three classifiers – LASSO, linear SVM, and Random Forest — on each training set and tested them on the microarray and RNA-seq holdout sets. The models were trained to predict tumor subtype (both cancer types have 5 subtypes) and the binary mutation status of TP53 and PIK3CA.
-
Unsupervised learning: We projected holdout sets onto and back out of the training set space using Principal Components Analysis to obtain reconstructed holdout sets. We then used the trained subtype classifiers to predict on the reconstructed holdout sets. PLIER (Pathway-Level Information ExtractoR) identified coordinated sets of genes in each cancer type.
To run the machine learning pipeline, run in top directory:
bash run_machine_learning_experiments.sh [cancer type] [prediction task] [n cores]
where
[cancer type]
isBRCA
orGBM
[prediction task]
issubtype
,TP53
, orPIK3CA
[n cores]
is the number of cores you want to run in parallel
To search for the number of publicly available microarray and RNA-seq samples from GEO and ArrayExpress, run
python3 search_geo_arrayexpress.py
and check the output in results/array_rnaseq_ratio
.
To compare PLIER pathways that are more frequently identified using the full sample size data compared to half sample size data, run
Rscript -e "rmarkdown::render('8-PLIER_pathways_analysis.Rmd', clean = TRUE)"
and examine the results in 8-PLIER_pathways_analysis.nb.html
.
This work was supported by the Gordon and Betty Moore Foundation [GBMF 4552], Alex's Lemonade Stand Foundation [GR-000002471], and the National Institutes of Health [T32-AR007442, U01-TR001263, R01-CA237170, K12GM081259].
Can I normalize array data to match RNA-seq data?
We generally do not advise this study design. We expect array data to have less precision at higher expression levels due to saturation, while counts-based RNA-seq data does not have that problem. We recommend reshaping the data expected to have more dynamic range (RNA-seq) to fit the narrower and less precise (array) distribution. See also TDM FAQs.