Skip to content

1. Pipeline Overview

Jared Johnson edited this page Dec 16, 2024 · 33 revisions

[Insert Flowchart]

Read Quality

Read quality is evaluated and managed using FastQC and fastp. Please submit a feature request if there are any additional QC steps you would like added (e.g., dehosting, PhiX removal, etc.,).

Reference Genomes

VAPER creates genome assemblies using a reference. This means that an appropriate reference must be supplied for each species/subtype that you plan to assemble. References can be supplied manually or selected automatically from a reference set.

Using the default reference genomes

VAPER comes with a default set of reference genomes for multiple viral species (see table below; files located in vaper/assets/reference_sets/). Alternative reference sets can be supplied using the --refs parameter. These reference sets were created using EPITOME and aim to capture the diversity of each species at intervals of 5% or greater sequence divergence. This divergence threshold is based on work conducted at WAPHL using varcraft, which demonstrated that optimum read mapping (BWA MEM) is achieved when sample-reference divergence is ≤ 5%. Please submit a feature request if there are any additional species you would like to add to this list!

Taxon Segments Input Sequences Data Source
Influenza A 1-8 78703 (per segment) GISAID
Influenza B 1-8 17401 (per segment) GISAID
Measles morbillivirus wg 890 NCBI
Mumps orthorubulavirus wg 1343 NCBI
Lyssavirus rabies wg 2607 NCBI
Norovirus wg 1662 NCBI
Respiratory Syncytial Virus wg 15273 GISAID
West Nile virus wg 1993 NCBI
Enterovirus D68 wg 590 NCBI
Hepacivirus wg 1245 NCBI
Hepatovirus wg 131 NCBI
Monkeypox virus wg 2129 NCBI
Severe acute respiratory syndrome coronavirus wg 2000 (random) NCBI

Supplying your own reference genomes

Reference genomes can be supplied to VAPER individually or as a set. In either case, all reference must be gzip compressed.

Supplying individual references

You can tell VAPER which reference(s) to use for each sample by supplying the path(s) to the reference(s) in the samplesheet or by suppling the name of the reference in the reference set. Multiple references can be supplied per sample using a semicolon. See the example below:

samplesheet.csv:

sample,fastq_1,fastq_2,reference
sample01,sample01_R1.fastq.gz,sample01_R2.fastq.gz,/home/viruses/refs/Influenza_A_NA-1.fa.gz;/home/viruses/refs/Influenza_A_HA-1.fa.gz
sample02,sample02_R1.fastq.gz,sample02_R2.fastq.gz,Influenza_A_NA-1;Influenza_A_HA-1

Supplying reference sets

Reference sets are used by VAPER for automated reference selection. These can be supplied as a comma-separated samplesheet, containing the absolute paths to each reference, or as a tar.gz compressed file which contains both the samplesheet and the reference files (must be in a directory called 'references/'). Using a tar file is the preferred method when working with large reference sets. Below is an example of how to prepare the reference samplesheet:

refsheet.csv

taxa,segment,assembly
Influenza_A,HA,Influenza_A_HA.fa.gz
Monkeypox_virus,wg,Monkeypox_virus_wg.fa.gz

Note

An example of how to set up the tar file can be found here

Automatic Reference Selection

VAPER can automatically select the "best" reference(s) for each sample from a supplied reference set. This can be performed using the fast or accurate mode. Reference sets should differ by at least 5% nucleotide identity or multiple references will be selected for a single organism. By default, references will only be selected if at least 10% of the reference is detected in the sample. This can be adjusted using the --ref_genfrac parameter.

Accurate Mode

As the name implies, reference selection using the accurate mode is more accurate but slower. References are selected by mapping contigs from a de novo assembly to the entire reference set using minimap2 -x asm5 --secondary=no. The -x asm5 flag means that contigs will only map to references that share approx. 95% nucleotide identity. The --secondary=no flag means that contigs will only map to the closest matching reference (no multi-mapping). Together, these parameters allow VAPER to choose reference(s) that best match the sample. This process still has room for improvement. Tweaking the parameters below may improve results if you run into any issues:

  • --ref_genfrac: controls the minimum percent of a reference that must be mapped by one or more de novo contig for it to be selected for consensus generation.
  • --denovo_assembler: controls which tool is used for de novo assembly. Options include megahit, spades, velvet, and skesa.
  • --denovo_contigcov: controls the minimum coverage required for a contig to be included in the de novo assembly.
  • --denovo_contiglen: controls the minimum contig length required for a contig to be included in the de novo assembly.

Fast Mode

Also aptly named, fast mode is faster but less accurate (how much faster is up for debate). This approach uses sourmash gather to determine which reference(s) in the reference set are best represented in the raw reads. This mode has not been thoroughly tested. You can adjust this primarily using the --ref_genfrac parameter.

Kitchen Sink Mode 🚽

VAPER has the option to include reference assemblies from GenBank for each species reported in the metagenomic analysis (--ref_kitchensink true). It is very likely that these assemblies will not be high quality, but it provides a quick-and-dirty option to capture species that are absent from your reference set. ⚠️ Use this feature with caution.

Genome Assembly

Genome assemblies can be created using either iVar or IRMA (default: --cons_assembler ivar).

iVar

iVar is the default assembler used by VAPER. Reads are aligned to the reference genome using BWA MEM and the alignment pileup is passed to iVar for evaluation of nucleotide quality and identity.

IRMA

IRMA is a CDC-developed viral assembler with some nifty features 🌼. The main selling point with IRMA is that it can iteratively adjust the reference genome to more closely match the sample reads, therefore resulting in a more "accurate" assembly. This is accomplished using a more forgiving initial read alignment approach that allows for greater sample-reference divergence. IRMA can also elongate the reference during the refinement process, ideally resulting in a more "complete" assembly (--cons_elong true; use with caution ⚠️). Out of the box, IRMA is limited to a select number of species: influenza, ebolavirus, and coronavirus. This is because IRMA requires species-specific "modules" that contain a defined set of reference genomes. We can get around this by building these modules on the fly using the VAPER references.

Caution

We have observed a decrease in assembly accuracy when using IRMA with large read sets (high read depth). While it remains unclear, it is suspected this may be a result of errors being introduced during the reference modification process, due to the inclusion of off-target reads by the liberal, initial read alignment steps. To combat this, you can now randomly subsample your reads to a specific depth using the --max_reads parameter.

Important

The default assembly generated by IRMA is a plurality consensus. This is quite different than what is produced by traditional assemblers. That said, IRMA also outputs an amended consensus that more closely aligns with the traditional approach. This amended consensus is the default output returned by VAPER. The third option is to return a padded consensus, which does not use iterative reference refinement but can provide insight into potential amplicon dropouts. You can change which assembly is returned using the --cons_type parameter. Read more about each approach here.

Quality control

Assembly quality is evaluated using Nextclade, along with some custom scripts. Quality metrics are reported relative to the reference genome used to create the assembly.

Metagenomic Classification

VAPER performs a basic viral metagenomic analysis using sourmash gather and sourmash tax metagenome with the 21-mer viral GenBank database. You can supply alternative database files using --sm_db and --sm_taxa.

Run Summary

A summary of the results can be found at ${params.ourdir}/VAPER-summary.csv.