A snakemake-based pipeline for assembling and polishing long genomes from long nanopore reads
Reticulatus was developed in part to manage the execution of long-read mock community experiments at the Loman Lab. It turns out that it's quite good, so I've generalised it for any long-read nanopore experiments, so you too can enjoy highly-contiguous, blisteringly fast, cutting-edge assembly and polishing too. Reticulatus was designed for assembly of whole-genomes from metagenomic data, but we have tried it on the odd isolate too. Reticulatus is not a 16S metataxonomics analysis pipeline.
Reticulatus is not an assembler or polisher, but a well stacked set of bioinformatics blocks. Reticulatus tries to codify what we at the Loman Lab think is the current best-practice for nanopore bioinformatics into a (hopefully) easy-to-use pipeline, taking advantage of all the goodness of Snakemake while adding a few features; including:
- a text-based read config that allows automated simple read pre-processing (deduplication, subsampling, merging)
- a text-based run config that provides a trivial way to define assembly and polishing strategies
- automatic generation of assembly bandage-art
- very fast GPU-accelerated polishing (racon, medaka)
- automated reporting of coverage and identity for contigs, for a set of references
As an attempted embodiment of best practice, Reticulatus is under development all of the time. Feel free to open an issue if it looks broken or send a pull request if it could work better.
Just so you know, the development of Reticulatus has:
- helped make
racon
even faster - demonstrated GPU accelerated tools can work on ONT hardware and made the containers to do so, freely available
- led to a port of
minidot
that works withminimap2
- led to a more efficient implementation of BAM-based read subsampling in pomoxis
- pushed some minor fixes to Snakemake
git clone https://github.com/SamStudio8/reticulatus.git; cd reticulatus;
sudo apt-get install build-essential python3-dev zlib1g-dev
You'll want to build an environment from either base-gpu
or base-cpu
. There isn't much difference, other than racon
and medaka
are absent from the GPU flavour base environment. If you're using the GPU, you'll need to install Singularity yourself for our medaka container to work.
conda env create --name reticulatus --file environments/base-gpu.yaml
conda activate reticulatus
cp Snakefile-base Snakefile
You will almost certaintly want the Snakefile-base
rule set for the time being. Run Snakefile-ref
with an appropriate ref.cfg
to replicate our mock community benchmarking pipeline.
Note It is important that you ensure snakemake-minimal
package is installed automatically using the environment specified above. Not only is this easier, but makes sure that the version installed is suitable for the overriden shell.py
that ships with reticulatus
.
cp config.yaml.example config.yaml
Replace the YAML keys as appropriate. Keys are:
Key | Type | Description |
---|---|---|
dehumanizer_database_root |
Path, optional | empty directory in which to download the dehumanizer references (requires ~8.5GB), you can ignore this if you're not going to remove contigs assigned as human by kraken2 |
kraken2_database_root |
Path | path to pre-built kraken2 database (i.e. the directory containing the .k2d files), or the path to a directory in which to wget a copy of our 30GB microbial database. If the database already exists, you must touch k2db.ok in this directory or bad things will happen |
ktkit_database_root |
Path | path to a directory in which to wget a copy of the NCBI taxonomy dump (500 MB, tops) |
slack_token |
str, optional | if you want to be bombarded with slack messages regarding the success and failure of your snakes, insert a suitable bot API token here |
slack_channel |
str, optional | if using a slack_token , enter the name of the channel to send messages, including the leading # |
cuda |
boolean | set to False if you do not want GPU-acceleration and True if you have the means to go very fast (i.e. you have a CUDA-compatible GPU) |
medaka_env |
URI | path to a singularity image (simg) or sandbox container to run medaka (CPU or GPU) |
racon_batches |
int | number of simultaneous batches to process on GPU |
polish_threads |
int | number of CPU threads to use for any polishing step |
polish_gpu |
int | number of GPU devices to use for any on-GPU polishing step |
assembly_threads |
int | number of CPU threads to use for any assembly step |
minimap2_threads |
int | number of CPU threads to use for any minimap2 step |
sort_flags |
str | additional parameters to pass to any samtools sort command (.e.g. to raise in-memory sort limit) |
cp reads.cfg.example reads.cfg
For each sample you have, add a tab delimited line with the following fields:
Key | Type | Description |
---|---|---|
sample_name |
str | a unique string that can be used to refer to this sample/readset later |
ont |
Path* | path to your long reads |
i0 |
Path*, optional | path to your single-pair short reads for this sample, otherwise you can just set to - |
i1 |
Path*, optional | path to your left paired-end short reads |
i2 |
Path*, optional | path to your right paired-end short reads |
* |
- | an arbitrary delimiter that has no purpose |
feel free to add your own columns for metadata here, fill your boots, reticulatus doesn't care |
* You can pre-process reads by modfying their file path as follows:
Option | Syntax | Description |
---|---|---|
Remove duplicates | myreads.rmdup.fq.gz | remove reads with a duplicate sequence header (to fix occasional duplicate reads arising from basecalling) |
Subset reads | myreads.subset-N.fq.gz | select a random subsample of N% (with integer N between 1-99) |
Merge reads | /path/to/merged/reads/:myreads.fq.gz,myotherreads.fq.gz,... | a root path for merged reads, followed by a colon and a comma delimited list of files to cat together, the filename will be chosen automatically and you should not be upset by this |
Pre-processing can be chained, for example: myreads.rmdup.subset-25.fq.gz
, will remove sequence name duplicates and take 25% of the result. You may also use this syntax to pre-process files for merging. Reticulatus will work out what needs to be done to generate the new read files, and will only need to do so once; even when you run the pipeline again in the future.
The processed reads will be written to the same directory as the original reads. Once this has been done, you can delete the original reads yourself, if you'd like.
Important If you're using the GPU, you must ensure the directories that contain your reads are bound to the singularity container with -B
in --singularity-args
, use the same path for inside as outside to make things easier.
cp manifest.cfg.example manifest.cfg
For each pipe you want to run, add a tab delimited line with the following fields:
Key | Type | Description |
---|---|---|
uuid |
str | a unique identifier, it can be anything, it will be used as a prefix for every file generated by this pipe, do not insert the . character here if you want things to work |
repolish |
str | if you wish to reuse an assembly for a different polishing scheme, enter the corresponding uuid name here, otherwise it must be set to - |
refgroup |
str | the reference set to check the assemblies and reads against, it must be a key from ref.cfg |
samplename |
str | the read set to assemble and polish, it must be a key from reads.cfg |
spell |
str | the "spell" to configure your assembly and polishing, corresponding to a named configuration in spellbook.py |
polishpipe |
str | a minilanguage that determines the polishing strategy. strategies are of the format <program>-<readtype>-<iterations> and are chained with the . character. e.g. racon-ont-4.medaka-ont-1.pilon-ill-1 will perform four rounds of iterative racon long-read polishing, followed by one round of medaka long-read polishing and finally one round of pilon short-read polishing. Currently the following polishers are supported: racon, medaka, pilon and dehumanizer. No polishing can be acheived by setting to - . |
medakamodel |
str | the option to pass to medaka_consensus -m , this corresponds to the model to use for medaka long-read polishing, it will depend on your ONT basecaller version |
feel free to add your own columns for metadata here, fill your boots, reticulatus doesn't care | ||
cpu |
int, optional | override the number of available CPU cores to this limit. this is optional, but if you use the field and don't want to override a sample, you must specify - |
gpu |
int, optional | override the number of available GPU interfaces to this limit. this is optional, but if you use the field and don't want to override a sample, you must specify - |
Run the pipeline with snakemake
, you must specify --use-conda
to ensure that
any tools that require a special jail (e.g. for python2
) are run far, far away
from everything else.
Set j
to the highest number of processes that you can fill with snakes before
your computer falls over.
snakemake -j <available_threads> --reason --use-conda
To activate GPU support for reticulatus, you must set the cuda
key to True in config.cfg
.
When invoking Snakemake you can set --resources gpu=N
where N
is the number of GPU interfaces you want to use. You can ignore this to use all GPU interfaces.
Currently, the GPU will accelerate the following steps:
polish_racon
: you will need a racon binary compiled withCUDA
, for your system. If you have multiple versions or previously installed racon to your environment, the GPU-enabled version will need to appear on your$PATH
before any other installed versions ofracon
. You can do this by exporting it to your path after activating the conda environment for reticulatus.polish_medaka
: you can use our singularity container defined inconfig.yaml
, use your own, or alternatively, skip containerisation altogether and ensure medaka is appropriately installed to your$PATH
.
To use singularity containers, you must specify --use-singularity
and provide suitable --singularity-args
to use the GPU (--nv
) and bind directories (-B
). You must bind the directory into which you have cloned reticulatus, as well as any other directories that contain your reads. Set the dir_inside
and dir_outside
keys to the same path to ensure the file paths inside the container, match those on the outside of the container.
e.g.
'--nv -B /data/sam-projects/reticulatus-testing/:/data/sam-projects/reticulatus-testing/ -B /path/to/reads/dir/:/path/to/reads/dir/ -B /path/to/more/reads/dir/:/path/to/more/reads/dir/'
For a full invocation example:
snakemake -j <available_threads> --reason --use-conda --use-singularity --singularity-args '--nv -B <dir_inside>:<dir_outside>' -k --restart-times 1 --resources gpu=N
Unless otherwise stated by a suitable header, the files within this repository are made available under the MIT license. If you use this pipeline, an acknowledgement in your work would be nice... Don't forget to cite Snakemake.
If reticulatus has saved your computing bill, maybe buy me a beer?