- update scripts for model choice
- add 3- and 4- pops pipelines
- remove unnecessary scripts
- edit Rscript for parameter estimations & RF
whole pipeline to perform coalescent simulations and abc inferences
This pipeline was used to reconstruct the demographic history of Atlantic salmon salmo salar but workds for any SNPs/sequence data to perform demographic modelling.
I also provide additional script to reproduce major analysis from the paper about "reconstructing the history of Atlantic salmon salmo salar using ABC" now publish in Evolution and available here
- IM: divergence with continuous gene flow
- AM: divergence with migration at the initial stage of the speciation process
- SI: divergence in strict isolation
- SC: divergence in strict isolation followed by gene flow
these models inclued linked selection and barrier to gene flow and are represented here:
- compile the coalescent simulator and stats calculator
- prepare your input file
- choose the model you want to run and the appropriate prior
- run the coalescent sims
- reshape the data
- run the abc analysis for :
-
(a) model selection
-
(b) robustness computation
-
(c) parameters estimation and goodness of fit
- (eventually run the pipeline using only barrier loci)
- (eventually draw neutral enveloppe to perform outlier detection test - I will add this later )
depending on your architecture there is more or less efficient way to run the code. icc compilation will make code runs faster. command lines for compilation are provided in:
compile_msnsam.sh
located in src/msnsam_code
folder
compile_mscalc.sh
located in src/mscalc_code
folder
- Alternatively for The coalescent simulator (msnsam) you can type:
gcc -O2 -o msnsam msnsam.c rand1.c streec.c -lm
gcc -o samplestats tajd.c sample_stats.c -lm
gcc -o mean_std stats.c -lm
if you have access to an icc architecture you can speeds up the code :
icc -O3 -xHost -o ../../bin/msnsam msnsam.c rand1.c streec.c -lm
- Then for The summary statistics computer (mscalc):
gcc *.c -o mscalc -lm
if you have access to an icc
architecture you can speed up the code :
icc -O3 -xHost *.c -o ../../bin/mscalc -lm
The data I used for abc analysis of the Salmo salar are in a companion github repository.
Download the github repository store the data in 01-salmon_data
and run:
./00-scripts/utility_scripts/prepare_abc.sh 01-salmon_data/salmon.data.ped 01-salmon_data/salmon.data.map
all the data will be created in different folders.
Now these data are ready to perform coalescent simulations. You can clone/paste the folders
00-scripts/ bin/ results/ and src/
in all the pairwise folder where you want to run simulations.
The original data are available on dryad:
10.5061/dryad.gm367 and doi:10.5061/dryad.sb601
If you want you can dowload them, merge them and filter them in plink
using the options:
--geno 0.05
--mind 0.05
--noweb
this will result in a dataset of ~5000 SNPs and 2035 individuals as provided in the sister github repository
if you are lazy preparing the data i also provide a ready to use example in the folder 00-example_input_abc
for the sister github repository
Then you'll have to choose the models and prior associate to each models.
Four classical models are implemented:
- Strict Isolation (SI)
- Isolation w. migration (IM)
- Secondary Contact (SC)
- Ancient migration (AM)
All models were used for the Salmon data. In addition the following models were implemented for some other project I may publish one day:
- panmixia
- bottleneck
- equilibrium
The four classic models have different alternatives. See Roux et al. 2013 MBE, Roux et al. 2014 JEB, Roux et al. 2016 Plos Biol for more info These are:
- NhomoMhomo (=same migration rate M and Ne among loci)
- NhomoMhetero (=same Ne among loci but different M among loci)
- NheteroMhomo (=same M but different Ne among loci )
- NheteroMhetero (= different M and different N among loci)
For the SC models there is also the possibility to test for:
- Periodic Secondary Contacts (PSC) when populations undergone two phases of splits and mixture during the divergence process
- bottleneck post divergence in the daughter populations
in ms
(and msnsam
) all Ne are scaled by 4*Nref*µ (*L) where:
Nref = size of a reference population
µ = mutation rate
L = length of the locus
so that θ = 4*N1*µ*L/4*Nref*µ*L
Similarly Τ = Tsplit/4*Nref*µ*L
and M = 4*Nref*m
For more details please look at the msdoc.pdf
manual before running your inferences.
Once you are okay with ms
coalescent parameters, you may want to modify the priors:
All priors on N1, N2, Nancestral, Migration rates, Split times, etc can be edited in the 00-scripts/models/model.*.sh
files.
Coalescent models are found in 00-scripts/models/model.{1..18}.sh
files.
Each number corresponds to a different model as follows:
- model.1.sh = SI heterogenous Ne
- model.2.sh = SI homogenous Ne
- model.3.sh = AM heterogenous Ne and heterogenous M
- model.4.sh = AM homogenous Ne and heterogenous M
- model.5.sh = AM heterogenous Ne and homogenous M
- model.6.sh = AM homogenous Ne and homogenous M
- model.7.sh = SC heterogenous Ne and heterogenous M
- model.8.sh = SC homogenous Ne and heterogenous M
- model.9.sh = SC heterogenous Ne and homogenous M
- model.10.sh = IM heterogenous Ne and heterogenous M
- model.11.sh = IM homogenous Ne and heterogenous M
- model.12.sh = IM heterogenous Ne and homogenous M
- model.13.sh = SC homogenous Ne and homogenous M
- model.14.sh = IM homogenous Ne and homogenous M
- model.15.sh = SCb heterogenous Ne and heterogenous M
- model.16.sh = PSC heterogenous Ne and heterogenous M
- model.17.sh = PSCb heterogenous Ne and heterogenous M
- model.18.sh = PCSbottleneck heterogenous Ne and heterogenous M
- model.19.sh = EQ heterogenous Ne homogenous M
- model.20.sh = EQ homogenous Ne heterogenous M
- model.21.sh = EQ heterogenous Ne heterogenous M
- model.22.sh = EQ homogenous Ne homogenous M
- model.00.sh = PAN homogenous Ne
- model.0.sh = PAN heterogenous Ne
I've run them using moab scheduler
on a cluster where I could launch several job in parallel.
This is the purpose of the 00-scripts/models/model_job_array*.sh
scripts that you'll have to customize according to your own cluster.
run the script ./00-scripts/00.reshape.sh
This will merge all parallel simulation into one prior file and one summary_statistics file with 1milions row (1 row/simulations)
(a) model selection
(b) robustness computation
(c) parameters estimation and goodness of fit
dependencies:
- R software
abc
package : more info here
run ./00-scripts/01.model.selection.sh
you'll need some abc background (see references list below)
##### robustness computation
cd compute_robustness.abc
run ./01-scripts/01_robustess_job.sh args
where arg is the name of the model either si.1 si.3 am.1 am.2 am.3 am.4 im.1 im.2 im.3 im.4 sc.1 sc.2 sc.3 sc.4
this is very time consumming....
./00-scripts/03.paramestim.im.all.sh
./00-scripts/03.paramestim.sc.all.sh
./00-scripts/03.paramestim.si.all.sh
./00-scripts/03.paramestim.am.all.sh
./00-scripts/06.goodness_fit.sh #for sc only!!!
./00-scripts/runall_abc_job.sh #for moab only
To do
treemix
available here and relevant reference here and there.
gsnap
available here
samtools
available here
The softwares must be into your path.
- download the Salmo salar reference genome here
store the data in 02-aditional_analisys/00-genome_alignment/01-input
- merge all chromosomes into a single fasta file (I used both the masked and unmasked version, it did not changed much of the results)
run a simple command like:
for i in 00-genome_alignment/01-input/*gz
do
gunzip $i ;
done
for i in 00-genome_alignment/01-input/*chrssa*fa
do
cat $i >> 00-genome_alignment/01-input/genome_fa ;
done
run gsnap
-
build the genome
./00-additional_analysis/00-scripts/01.gsnap_build.sh
-
align the sequences
./02-additional_analysis/00-scripts/02.gsnap_align.sh
-
then remove hardclip and softclip and order the ped files according the markers positions in the genome
(I'll document that soon)
I provide the reference fasta sequence in the sister repository
run Treemix and f3-tests
as always, the data (ped files) to reproduce th example are located in the sister repository in 02-treemix-data
.
You can download them, uncompress and store them in 02-additional_analysis/02-treemix/
#if you have already used treemix then you'll surely already have your own script to run the software.
#in this case simply prepare the data as follows
datafolder=02-additional_analyis/02-treemix
plink --file "$datafolder"/salmon.ordered.V2 --noweb --missing --freq --double-id --allow-extra-chr --chr-set 29 --within cluster.dat --out "$datafolder"/plink
gzip "$datafolder"/plink.frq.strat
02-additional_analysis/00-scripts/treemix/plink2treemix.py "$datafolder"/plink.frq.strat.gz "$datafolder"/treemix.frq.gz
#then you'll know what to do.
#Alternatively to prepare the input from the ped files and run the f3-test use the following command:
./02-additional_analysis/00-scripts/treemix/prepare_treemix.sh #will prepare the treemix data and perform f3tests
#Then to run treemix use a script like the one here
./02-additional_analysis/00-scripts/treemix/run_treemix.sh
#Note that this script is very simplified version, I used a more elaborate script to run in parrallel mode.
You can download the data in the sister repository in 03-spacemix-data
and store them in 02-additional_analysis/spacemix/01-data
.
cd 02-additional_analysis/03-spacemix
#creating input file:
./02-additional_analysis/03-spacemix/00-scripts/00.ped_to_matrice.R path/to/input/input.ped #with input ped the name of the ped_file (with the path/)
./02-additional_analysis/03-spacemix/00-scripts/01.make.allele.freq.cov.2.R path/to/input/matrix #with matrix the matix name and the path
#run spacemix (Here I very simply follow guideline from Bradburg's github)
./02-additional_analysis/03-spacemix/00-scripts/02.run.spacemix.R sample.covariance.matrix sample.size x_y.coordinates
#this takes some times and needs to be run on a cluster
./02-additional_analysis/03-spacemix/00-scripts/03.chek.spacemix.model.R arg1 arg2 arg3
#then plot the results
#This is very straightforward if you follow examples from spacemix manual
Review about abc:
- Beaumont MA, Zhang W, Balding DJ. Approximate Bayesian Computation in Population Genetics. Genetics. 2002;162: 2025–2035.
- Csillery, K., M.G.B. Blum, O.E. Gaggiotti and O. Francois (2010) Approximate Bayesian Computation (ABC) in practice. Trends in Ecology and Evolution, 25, 410-418.
Must Reads about Genome-Wide Heterogeneity:
Programms:
Salmon paper: