-
Notifications
You must be signed in to change notification settings - Fork 0
1. Pipeline Overview
[Insert Flowchart]
Read quality is evaluated and managed using FastQC and fastp. Please submit a feature request if there are any additional QC steps you would like added (e.g., dehosting, PhiX removal, etc.,).
VAPER creates genome assemblies using a reference. This means that an appropriate reference must be supplied for each species/subtype that you plan to assemble. References can be supplied manually or selected automatically from a reference set.
VAPER comes with a default set of reference genomes for multiple viral species (see table below; files located in vaper/assets/reference_sets/
). Alternative reference sets can be supplied using the --refs
parameter. These reference sets were created using EPITOME and aim to capture the diversity of each species at intervals of 5% or greater sequence divergence. This divergence threshold is based on work conducted at WAPHL using varcraft, which demonstrated that optimum read mapping (BWA MEM) is achieved when sample-reference divergence is ≤ 5%. Please submit a feature request if there are any additional species you would like to add to this list!
Taxon | Segments | Input Sequences | Data Source |
---|---|---|---|
Influenza A | 1-8 | 78703 (per segment) | GISAID |
Influenza B | 1-8 | 17401 (per segment) | GISAID |
Measles morbillivirus | wg | 890 | NCBI |
Mumps orthorubulavirus | wg | 1343 | NCBI |
Lyssavirus rabies | wg | 2607 | NCBI |
Norovirus | wg | 1662 | NCBI |
Respiratory Syncytial Virus | wg | 15273 | GISAID |
West Nile virus | wg | 1993 | NCBI |
Enterovirus D68 | wg | 590 | NCBI |
Hepacivirus | wg | 1245 | NCBI |
Hepatovirus | wg | 131 | NCBI |
Monkeypox virus | wg | 2129 | NCBI |
Severe acute respiratory syndrome coronavirus | wg | 2000 (random) | NCBI |
Reference genomes can be supplied to VAPER individually or as a set. In either case, all reference must be gzip compressed.
You can tell VAPER which reference(s) to use for each sample by supplying the path(s) to the reference(s) in the samplesheet or by suppling the name of the reference in the reference set. Multiple references can be supplied per sample using a semicolon. See the example below:
samplesheet.csv:
sample,fastq_1,fastq_2,reference
sample01,sample01_R1.fastq.gz,sample01_R2.fastq.gz,/home/viruses/refs/Influenza_A_NA-1.fa.gz;/home/viruses/refs/Influenza_A_HA-1.fa.gz
sample02,sample02_R1.fastq.gz,sample02_R2.fastq.gz,Influenza_A_NA-1;Influenza_A_HA-1
Reference sets are used by VAPER for automated reference selection. These can be supplied as a comma-separated samplesheet, containing the absolute paths to each reference, or as a tar.gz compressed file which contains both the samplesheet and the reference files (must be in a directory called 'references/'). Using a tar file is the preferred method when working with large reference sets. Below is an example of how to prepare the reference samplesheet:
refsheet.csv
taxa,segment,assembly
Influenza_A,HA,Influenza_A_HA.fa.gz
Monkeypox_virus,wg,Monkeypox_virus_wg.fa.gz
Note
An example of how to set up the tar file can be found here
VAPER can automatically select the "best" reference(s) for each sample from a supplied reference set. This can be performed using the fast
or accurate
mode. Reference sets should differ by at least 5% nucleotide identity or multiple references will be selected for a single organism. By default, references will only be selected if at least 10% of the reference is detected in the sample. This can be adjusted using the --ref_genfrac
parameter.
As the name implies, reference selection using the accurate
mode is more accurate but slower. References are selected by mapping contigs from a de novo assembly to the entire reference set using minimap2 -x asm5 --secondary=no
. The -x asm5
flag means that contigs will only map to references that share approx. 95% nucleotide identity. The --secondary=no
flag means that contigs will only map to the closest matching reference (no multi-mapping). Together, these parameters allow VAPER to choose reference(s) that best match the sample. This process still has room for improvement. Tweaking the parameters below may improve results if you run into any issues:
-
--ref_genfrac
: controls the minimum percent of a reference that must be mapped by one or more de novo contig for it to be selected for consensus generation. -
--denovo_assembler
: controls which tool is used for de novo assembly. Options includemegahit
,spades
,velvet
, andskesa
. -
--denovo_contigcov
: controls the minimum coverage required for a contig to be included in the de novo assembly. -
--denovo_contiglen
: controls the minimum contig length required for a contig to be included in the de novo assembly.
Also aptly named, fast
mode is faster but less accurate (how much faster is up for debate). This approach uses sourmash gather
to determine which reference(s) in the reference set are best represented in the raw reads. This mode has not been thoroughly tested. You can adjust this primarily using the --ref_genfrac
parameter.
VAPER has the option to include reference assemblies from GenBank for each species reported in the metagenomic analysis (--ref_kitchensink true
). It is very likely that these assemblies will not be high quality, but it provides a quick-and-dirty option to capture species that are absent from your reference set.
Genome assemblies can be created using either iVar or IRMA (default: --cons_assembler ivar
).
iVar is the default assembler used by VAPER. Reads are aligned to the reference genome using BWA MEM and the alignment pileup is passed to iVar for evaluation of nucleotide quality and identity.
IRMA is a CDC-developed viral assembler with some nifty features 🌼. The main selling point with IRMA is that it can iteratively adjust the reference genome to more closely match the sample reads, therefore resulting in a more "accurate" assembly. This is accomplished using a more forgiving initial read alignment approach that allows for greater sample-reference divergence. IRMA can also elongate the reference during the refinement process, ideally resulting in a more "complete" assembly (--cons_elong true
; use with caution
Caution
We have observed a decrease in assembly accuracy when using IRMA with large read sets (high read depth). While it remains unclear, it is suspected this may be a result of errors being introduced during the reference modification process, due to the inclusion of off-target reads by the liberal, initial read alignment steps. To combat this, you can now randomly subsample your reads to a specific depth using the --max_reads
parameter.
Important
The default assembly generated by IRMA is a plurality
consensus. This is quite different than what is produced by traditional assemblers. That said, IRMA also outputs an amended
consensus that more closely aligns with the traditional approach. This amended
consensus is the default output returned by VAPER. The third option is to return a padded
consensus, which does not use iterative reference refinement but can provide insight into potential amplicon dropouts. You can change which assembly is returned using the --cons_type
parameter. Read more about each approach here.
Assembly quality is evaluated using Nextclade, along with some custom scripts. Quality metrics are reported relative to the reference genome used to create the assembly.
VAPER performs a basic viral metagenomic analysis using sourmash gather
and sourmash tax metagenome
with the 21-mer viral GenBank database. You can supply alternative database files using --sm_db
and --sm_taxa
.
A summary of the results can be found at ${params.ourdir}/VAPER-summary.csv
.