This repository contains the scripts generated for analyzing data for the 500HT RNA-seq Project.
If one wanted to use the scripts herein to anaylze their own data using GEMMA, they would need to use the following pipeline:
Expression matrix should be in bimbam format with each column representing a gene (separated by " ", though this may not be strictly necessary) and each row representing an individual.
Matrix of PCs in same order as expression matrix - each row is an individual and each column is different PC.
- See cis/pca.R
Square matrix of relatedness can either be generated by GEMMA itself, or by extraction from the additive covariance matrix from the Abney lab.
- See cis/square_matrix_maker.py (change locations of Abney covariance file (line 2), .fam file listing findivs (line 10) and ouput square matrix(line 13))
- Recommended: wait to generate plink files below and then use that .fam file to generate square matrix in appropriate order
Genotypes of choice need to to be in plink .bed format. This can be extracted from the master plink files with this command:
plink --bfile hutt.imputed.rename --keep [ID File] --make-bed --out hutt.imputed.newsubset
The [ID File] has a column of "HUTTERITES" (repeated) and column of individual IDs.
This genotype data can then be converted into a set of tabix-indexed files (separate files for each chromosome) with cis/plink_bed2tabix.py.
Usage: python plink_bed2tabix.py
Necessary python modules:
- subprocess
- time
- glob
- sys
- DarrenTools (defines 'ifier' and 'matrix_reader' functions)
- numpy (for raw2txt.py)
Hardcoded lines:
- Line 20: directory containing genotype files. Script expects a 'ByChr/' directory to be present within the genotype directory.
- Line 21: name of output files (currently set to 'hutt.all.imputed'
- Line 25: name of file (currently set to 'hutt.imputed.rename')
- Line 36: location of raw2txt.py script
Note:
Running python plink_bed2tabix.py
will submit qsub jobs automatically. Therefore you must be on a node that is able to submit qsub jobs.
I could not find a way to affect the order of individuals in the plink files, so I reordered everything else to match the plink files. I would recommend that others generate the genotype files first, then reorder the expression matrix according to this file, then generate the PC matrix and covariance matrix.
- See cis/expr_reorg.R
###1. cis/column_grabber.py
Usage: python column_grabber.py
Hardcoded lines:
- Line 5: location of plink format .bim file for genotypes
- Line 19: names of genes from expression matrix in same order as expression matrix
- Line 31: for each chromosome (1-22), names of genes by chromosome-specific files (again order should match chrom-specific expression matrices)
- Line 39: name of master index table to write out to
- Line 40: name of master index table if genes are broken out by chromosome
- Line 43: file listing gene/snp overlaps (column 1 = gene name, column 2 = snpID). I used bedtools and cut to generate this.
Note: Generating the gene/SNP overlap file can be accomplished in a few short steps:
Step 1: A .bed file of transcription start sites needs to defined (chr start stop geneID). Defining the TSS is actually not a trivial task and requires careful consideration of the study design, so I will leave that to you to decide.
Step 2: A .bed file of SNP coordinates can be created based on the .bim file with the following command:
awk -v OFS='\t' '{ print "chr"$1, $4-1, $4, $2}' [input .bim file] > [output .bed file]
Step 3: After you pick a window to define as the cis test region, you can generate the overlap file with bedtools:
/mnt/lustre/home/cusanovich/Programs/BEDTools/bin/windowBed -w [cis region size in bp] -a [TSS .bed file] -b [SNP .bed file] | cut -f4,8 > [gene/SNP overlap file]
Result: Master index table that lists Gene Name, SNP ID, row from expression matrix of gene (0-based), row of SNP from .bim (0-based), chromosome ID (e.g. 'chr1')
###2. cis/multi_pc_eqtl_driver.sh
Usage: bash multi_pc_eqtl_driver.sh
Hardcoded lines:
- Line 2: # of PCs to regress out
e.g. 'for i in $(seq 61 1 100)' with seq [start] [step] [stop]
- Line 4: location of eqtl_driver.py script
Note: Must be called from a node that can submit qsub jobs
Result: Calls eqtl_driver.py separately for each number of PCs we wish to regress out.
###3. cis/eqtl_driver.py (called directly from #2)
Usage: python eqtl_driver.py [No. of PCs] (typically called from within multi_pc_eqtl_driver.sh)
Necessary python modules:
- subprocess
- glob
- time
- sys
Hardcoded lines:
- Line 19: location of alt_gemma_noplink_eqtl_mapper.py (actual mapper script)
- Line 19: stdout and stderr file locations (e.g. '~/dump/')
- Lines 22-25: Files to monitor for a finished chromosomal eqtl mapping (e.g. '/mnt/lustre/home/cusanovich/500HT/ByChr/*.PC' + str(pcs) + '.bonferroni.done')
- Line 28: Location of output files (e.g. '/mnt/lustre/home/cusanovich/500HT/ByChr/*.PC' + str(pcs) + '.imputed.1Mb.bonferroni.gccor.newcovcor.regressPCs.gemma.eqtls.txt' and '/mnt/lustre/home/cusanovich/500HT/eQTLs/master.PC' + str(pcs) + '.imputed.1Mb.bonferroni.gccor.newcovcor.regressPCs.gemma.chosen.txt')
Results: For each chromosome, submits the eQTL mapper job, monitors the outdir for completion of mapping, then concatenates all the chrom files into a master file for the eQTL mapping with the current number of PCs regressed out.
###4. cis/alt_gemma_noplink_eqtl_mapper.py (called directly from #3)
Usage: python alt_gemma_noplink_eqtl_mapper.py [Chromosome ID (e.g. 'chr1')] [No. PCs to regress out]
Necessary python modules:
- os
- sys
- pysam
- subprocess
- numpy
- random
- gzip
- DarrenTools (defines 'ifier' and 'matrix_reader' functions)
Hardcoded lines:
- Line 20: True/False whether to look for GC content corrected expression matrix
- Line 21: True/False whether to look for covariate corrected expression matrix
- Line 22: True/False whether to use Bonferroni correction for number of SNPs tested for each gene (alternative is permutations - Slow!!!)
- Line 23: True/False whether to regress out PCs before GEMMA or include them as a covariate
- Line 28: Directory where genotype files are
- Line 30: Home directory (just used to shorten some of the code in the file)
- Lines 31-50: Naming conventions for keeping track of True/False toggles above in output file name
- Lines 63 and 65: part of permutation function requires some hardcoded locations
- Lines 78 and 81: location of master index table from #1
- Line 97: location of .bed format file listing SNP locations
- Line 102: hmdir + location of square additive covariance matrix for relatedness, and where a current copy can be made
- Line 106: hmdir + location of .bimbam format gene expression matrix
- Line 107: hmdir + location of PC values for regressing out PCs
- Line 110: hmdir + location of PC values if supplying them to GEMMA
- Line 151: For each gene, where to write out current expression values for GEMMA
- Lines 153-155: hmdir + locations to write out expression data if supplying PCs to GEMMA as covariate
- Line 165: location of Tabix-indexed genotypes
- Line 171: location where to write out current genotypes
- Line 176: hmdir + location of GEMMA program, current genotypes, current expression values, current copy of relatedness, non-default arguments to GEMMA (which tests to run, MAF)
- Line 179: same as above except for when passing PCs to GEMMA as covariate
- Line 181: location where GEMMA results are written out ('output' directory within directory specified in Line 28)
- Line 191, 193 and 194: test used (I am using the LRT results, Column 12 of GEMMA output)
- Line 213: location to combine all results for current chromosome to (reports all P-values for all SNP/Gene pairs)
- Line 219: location to combine bonferroni-corrected lead SNP/gene pairs to
- Line 225: location to write out '.done' file when finished (this must match up with where #3 is monitoring)
Results: For each gene on the current chromosome, will run GEMMA against all SNPs specified and collect results into two files:
- '.gemma.eqtls.txt' file gives p-values for all SNP/gene pairs
- '.gemma.chosen.txt' file gives original and bonferroni-corrected p-values for lead SNP/gene pairs, where lead SNP means the SNP with the smallest p-value observed for each gene
- cis/eqtl_resubmitter.py
Sometimes the pipeline can break down. This file scans for chromosome/PC configurations that do not appear to have finished and re-submits them. - cis/eQTL_evaluator.R
Make a plot of the number of eQTLs identified at a certain FDR for all the PC regression files available.