Skip to content

Commit

Permalink
chp3
Browse files Browse the repository at this point in the history
  • Loading branch information
luizirber committed Sep 18, 2020
1 parent 982da6c commit 7e44367
Show file tree
Hide file tree
Showing 6 changed files with 235 additions and 60 deletions.
9 changes: 8 additions & 1 deletion thesis/01-scaled.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,13 @@ The {#rmd-basics} text after the chapter declaration will allow us to link throu

# Accurate containment estimation with Scaled MinHash sketches {#chp-scaled}

```{=latex}
\begin{epigraphs}
\qitem{The aim of science is not to open the door to infinite wisdom, but to set a limit to infinite error.}%
{Bertolt Brecht}
\end{epigraphs}
```

\chaptermark{Scaled MinHash}

## Introduction
Expand Down Expand Up @@ -432,7 +439,7 @@ The Rust version has four direct external dependencies:
`murmurhash3` for the MurmurHash3 hash function,
`serde_json` for JSON serialization and `structopt` for command line parsing.

Full source is available in [Appendix A](#smol-source-code).
Full source is available in Appendix [A](#smol-source-code).

#### exact

Expand Down
9 changes: 9 additions & 0 deletions thesis/02-index.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,15 @@

\chaptermark {Indexing}

```{=latex}
\begin{epigraphs}
\qitem{A complex system that works is invariably found to have evolved from a simple system that worked.
A complex system designed from scratch never works and cannot be patched up to make it work.
You have to start over with a working simple system.}%
{John Gall}
\end{epigraphs}
```

## Introduction

<!-- TODO
Expand Down
194 changes: 140 additions & 54 deletions thesis/03-gather.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,20 @@

```{=latex}
\begin{epigraphs}
\qitem{The aim of science is not to open the door to infinite wisdom, but to set a limit to infinite error.}%
{Bertolt Brecht}
\qitem{I seldom end up where I wanted to go, but almost always end up where I need to be.}%
{Douglas Adams}
\end{epigraphs}
```

## Introduction

<!--
- What are the goals of compositional analysis in biological systems?
-->

Taxonomic profiling is a particular instance of this general problem.
The goal of taxonomic profiling is to find the identity and relative abundance of microbial community elements
Compositional data analysis is the study of the parts of a whole using relative abundances [@aitchison_statistical_1982].
This is a general problem with applications across many scientific fields [@aitchison_compositional_2005],
and examples in biology include RNA-Seq [@quinn_field_2019-1],
metatranscriptomics [@macklaim_rna-seq_2018],
microbiome and metagenomics [@li_microbiome_2015].
Taxonomic profiling is a particular instance of this general problem
with the goal of finding the identity and relative abundance of microbial community elements
at a specific taxonomic rank (species, genus, family),
especially in metagenomic samples [@sczyrba_critical_2017].

Expand All @@ -30,7 +31,7 @@ $k$-mer assignments matching multiple taxons from a reference database
with an option to reduce it further to the LCA [@kim_centrifuge_2016].

Once each sequence (from raw reads or assembled contigs) has a taxonomic assignment,
these methods resolve the final identity and abundance for each member of the community by summarizing the assignments to a specific taxonomic level.
these methods resolve the final identity and abundance for each member of the community by summarizing the assignments to a specific taxonomic rank,
Taxonomic profiling is fundamentally limited by the availability of reference datasets to be used for assignments,
and reporting what percentage of the sample is unassigned is important to assess results,
especially in under characterized environments such as oceans and soil.
Expand Down Expand Up @@ -197,8 +198,6 @@ knitr::include_graphics('figure/cami_i_low_recall_precision.png')

#### CAMI II mouse gut metagenome dataset

<!-- TODO more intro -->

The CAMI initiative released new challenges in 2019 (marine, high-strain and pathogen detection)
and 2020 (rhizosphere),
with updated processes for submission,
Expand Down Expand Up @@ -249,13 +248,73 @@ The higher the completeness and purity, and the lower the L1 norm, the better th
**c** Methods rankings and scores obtained for the different metrics over all samples and taxonomic ranks.
For score calculation, all metrics were weighted equally.

Figure \@ref(fig:gatherCAMImgSpider) is an updated version of Figure 6 from [@meyer_tutorial_2020] including `sourmash`,
comparing 10 different methods for taxonomic profiling and their characteristics at each taxonomic rank.
While previous methods show reduced completeness,
the ratio of taxa correctly identified in the ground truth,
below the genus level,
`sourmash` can reach 88.7\% completeness at the species level with the highest
purity (the ratio of correctly predicted taxa over all predicted taxa) across
all methods:
95.9\% when filtering predictions below 1\% abundance,
and 97\% for unfiltered results.
`sourmash` also has the lowest L1-norm error
(the sum of the absolute difference between the true and predicted abundances at
a specific taxonomic rank),
the highest number of true positives and the lowest number of false positives.

Table: (\#tab:gather-cami2) Updated Supplementary Table 12 from [@meyer_tutorial_2020].
Elapsed (wall clock) time (h:mm) and maximum resident set size
(kbytes) of taxonomic profiling methods on the 64 short read samples of the CAMI II mouse
gut data set. The best results are shown in bold. Bracken requires to run Kraken, hence the times
required to run Bracken and both tools are shown. The taxonomic profilers were run on a
computer with an Intel Xeon E5-4650 v4 CPU (virtualized to 16 CPU cores, 1 thread per core)
and 512 GB (536.870.912 kbytes) of main memory.

| Taxonomic binner | Time (hh:mm) | Memory (kbytes) |
|:--------------------------------|-------------:|----------------:|
| MetaPhlAn 2.9.21 | 18:44 | 5,139,172 |
| MetaPhlAn 2.2.0 | 12:30 | 1,741,304 |
| Bracken 2.5 (only Bracken) | **0:01** | **24,472** |
| Bracken 2.5 (Kraken and Bracken)| **3:03** | 39,439,796 |
| FOCUS 0.31 | 13:27 | 5,236,199 |
| CAMIARKQuikr 1.0.0 | 16:19 | 27,391,555 |
| mOTUs 1.1 | 19:50 | **1,251,296** |
| mOTUs 2.5.1 | 14:29 | 3,922,448 |
| MetaPalette 1.0.0 | 76:49 | 27,297,132 |
| TIPP 2.0.0 | 151:01 | 70,789,939 |
| MetaPhyler 1.25 | 119:30 | 2,684,720 |
| sourmash 3.4.0 | 16:41 | 5,760,922 |

When considering resource consumption and running times,
`sourmash` used 5.62 GB of memory with an _LCA index_ built from the
RefSeq snapshot (141,677 genomes) with $scaled=10000$ and $k=51$.
Each sample took 597 seconds to run (on average),
totalling 10 hours and 37 minutes for 64 samples.
MetaPhlan 2.9.21 was also executed in the same machine,
a workstation with an AMD Ryzen 9 3900X 12-Core CPU running at 3.80 GHz,
64 GB DDR4 2133 MHz of RAM and loading data from an NVMe SSD,
in order to compare to previously reported times in Table \@ref(tab:gather-cami2) [@meyer_tutorial_2020].
MetaPhlan took 11 hours and 25 minutes to run for all samples,
compared to 18 hours and 44 minutes previously reported,
and correcting the `sourmash` running time by this factor it would likely take
16 hours and 41 minutes in the machine used in the original comparison.
After correction,
`sourmash` has similar runtime and memory consumption to the other best performing tools
(_mOTUs_ and _MetaPhlAn_),
both gene marker and alignment based tools.

Additional points are that `sourmash` is a single-threaded program,
so it didn't benefit from the 16 available CPU cores,
and it is the only tool that could use the full RefSeq snapshot,
while the other tools can only scale to a smaller fraction of it
(or need custom databases).
The CAMI II RefSeq snapshot for reference genomes also doesn't include viruses;
this benefits `sourmash` because viral _Scaled MinHash_ sketches are usually not well supported for containment estimation,
since viral sequences require small scaled values to have enough hashes to be reliable.

<!-- TODO: show results
main points:
- all tools used only the short reads datasets
- highest completeness and purity
- low L1-norm error (second, check this)
- for running times: run kraken2 to serve as a base and compare with running
times from the tutorial paper?
- can't really build the standard db, check with minikraken...
Expand Down Expand Up @@ -292,16 +351,13 @@ main points:
after correction: 38251 * 1.57 = 60054
-->

<!--
The CAMI II RefSeq snapshot for reference genomes doesn't include viruses;
this benefits `sourmash` because viral _Scaled MinHash_ sketches are usually not well supported for containment estimation,
since viral sequences require small scaled values to have enough hashes to be reliable.
-->

## Discussion

`gather` is a new method for decomposing datasets into its components, (...)
<!-- TODO small summary -->
`gather` is a new method for decomposing datasets into its components that
outperforms current method when using synthetic datasets with known composition.
By leveraging _Scaled MinHash_ sketches and efficient indexing data structures
it can scale the number of reference datasets used by over an order of magnitude when compared
to existing methods.

Other containment estimation methods described in Chapter [1](#chp-scaled),
_CMash_ [@koslicki_improving_2019] and _mash screen_ [@ondov_mash_2019],
Expand Down Expand Up @@ -356,36 +412,46 @@ This allows using new taxonomies derived from the same underlying datasets [@par
as well as updates to the original taxonomy used before.

<!-- Benchmarking -->
<!-- TODO
Despite the improvements to the metrics measure by CAMI benchmarks,
Benchmarking is still an unsolved problem,
despite CAMI efforts
-->
Despite improvements to standardization and reproducibility of previous analysis,
benchmarking taxonomic profiling tools is still challenging,
since tools can generate their reference databases from multiple sources and
choosing only one source can bias or make it impossible to evaluate them properly.
This is especially true for real metagenomic datasets derived from samples
collected from soil and marine environments,
where the number of unknown organisms is frequently larger than those contained in
reference databases.
With the advent of metagenome-assembled genomes (MAGs) there are more resources
available for usage as reference datasets,
even if they are usually incomplete or draft quality.
`sourmash` is well positioned to include these new references to taxonomic
profiling given the minimal requirements (a _Scaled MinHash_ sketch of the
original dataset) and support for indexing hundreds of thousands of datasets.


<!--
### Limitations

- Viruses (scaled minhash too small).
mash screen solves this by going for sensitivity (at the cost of precision),
possible solution: scaled+num hashes, but would only allow mash screen-like method
- gather assigns hashes to best matches ("winner-takes-all"). Other approaches
will be needed to disambiguate matches further
-> check species score from centrifuge
- abundance analysis
-> EM from centrifuge, based on cufflinks/sailfish
- Because it favors the best containment, larger organisms are more likely to be chosen first
(since they have more hashes/k-mers).
An additional step could take in consideration the size of the match too?
This is not quite the similarity, note: We still want containment, but add match size
to the calculation without necessarily using |A union B|
Scaled MinHash sketches don't preserve information about individual sequences,
and for larger scaled values and short sequences they have increasingly smaller chances of not having any of its $k$-mers (represented as hashes) contained in the final sketch.
-->
`gather` as implemented in `sourmash` has the same limitations as _Scaled MinHash_ sketches,
including reduced sensitivity to small genomes/sequences such as viruses.
_Scaled MinHash_ sketches don't preserve information about individual sequences,
and short sequences using large scaled values have increasingly smaller chances of having any of its
$k$-mers (represented as hashes) contained in the sketch.
Because it favors the best containment,
larger genomes are also more likely to be chosen first due to their sketches have more elements,
and further improvements can take the size of the match in consideration too.
Note that this is not necessarily the _similarity_ $J(A, B)$ (which takes the size of both $A$ and $B$),
but a different calculation that normalizes the containment considering the size of the match.

`gather` is also a greedy algorithm,
choosing the best containment match at each step.
Situations where multiple matches are equally well contained or many datasets
are very similar to each other can complicate this approach,
and additional steps must be taken to disambiguate matches.
The availability of abundance counts for each element in the _Scaled MinHash_ is not well explored,
since the process of _removing elements_ from the query doesn't account for them
(the element is removed even if the count is much higher than the count in the match).
Both the multiple match as well as the abundance counts issues can benefit from
existing solutions taken by other methods,
like the _species score_ (for disambiguation) and _Expectation-Maximization_ (for abundance analysis)
approaches from Centrifuge [@kim_centrifuge_2016].

### Future directions

Expand All @@ -395,7 +461,7 @@ but it can applied to other biological problems too.
If the query is a genome instead of a metagenome,
`gather` can be used to detect possible contamination in the assembled genome by
using a collection of genomes and removing the query genome from it (if it is present).
This allows finding matches that contain the query genome and evaluating if they agree at specific taxonomic levels,
This allows finding matches that contain the query genome and evaluating if they agree at specific taxonomic rank,
and in case of large divergence (two different phyla are found, for example)
it is likely to indicative that the query genome contains sequences from different organisms.
This is especially useful for quality control and validation of metagenome-assembled genomes (MAGs),
Expand All @@ -419,15 +485,34 @@ but show how `gather` works as a more general method that can leverage character

### Conclusion


`gather` is a new method for decomposing datasets into its components with
application in biological sequencing data analysis (taxonomic profiling) that
can scale to hundreds of thousands of reference datasets with computational
resources requirements that are accessible to a large number of users
when used in conjunction with _Scaled MinHash_ sketches and efficient indices
such as _LCA_ and _MHBT_.
It outperforms current methods in community-develop benchmarks,
and opens the way for new methods that explore a top-down approach for profiling
microbial communities,
including further refinements that can resolve larger evolutionary distances and
also speed up the method computationally.

## Methods

### The gather algorithm

Algorithm [1](\ref{alg:gather}) describes the `gather` method using a generic operation
`FindBestContainment`.
An implementation for `FindBestContainment` for a list of datasets is presented in
Algorithm [2](\ref{alg:list}).
Appendix [A](#smol-source-code) has a minimal implementation in the Rust programming language [@matsakis_rust_2014] for both algorithms,
including a _Scaled MinHash_ sketch implementation using a _set_ data structure from the Rust standard library (`HashSet`).

```{=latex}
\RestyleAlgo{boxruled}
\LinesNumbered
\begin{algorithm}[ht]
\label{alg:gather}
\DontPrintSemicolon
\SetKwInOut{Input}{Input}
\SetKwInOut{Output}{Output}
Expand Down Expand Up @@ -455,6 +540,7 @@ but show how `gather` works as a more general method that can leverage character
\end{algorithm}
\begin{algorithm}[ht]
\label{alg:list}
\DontPrintSemicolon
\SetKwInOut{Input}{Input}
\SetKwInOut{Output}{Output}
Expand All @@ -476,8 +562,8 @@ but show how `gather` works as a more general method that can leverage character
}
\KwRet{$(b, m)$}
\caption{a \emph{FindBestContainment} implementation for a list}
\label{alg:list}
\end{algorithm}
```

### Implementation

Expand All @@ -487,7 +573,7 @@ but show how `gather` works as a more general method that can leverage character
This ABC declares methods that any index in `sourmash` has to implement,
but each index is allowed to implement it in the most efficient or performant way:

1. For `Linear` indices, the `FindBestContainment` operation is implemented as a linear scan over the list of signatures (Algorithm \@ref(alg:list));
1. For `Linear` indices, the `FindBestContainment` operation is implemented as a linear scan over the list of signatures (Algorithm [2](\ref{alg:list}));

2. For `MHBT` indices, `FindBestContainment` is implemented as a depth-first search that tracks the best containment found,
and prunes the search if it the current estimated containment in an internal node is lower than the current best containment.
Expand Down
2 changes: 1 addition & 1 deletion thesis/04-distributed.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -314,7 +314,7 @@ including geographical distribution and pangenomics analysis.

Executing the search took 11 hours to run,
using less than 5GB of RAM and 24 threads.
The machine used was a workstation with a AMD Ryzen 9 3900X 12-Core CPU running at 3.80 GHz,
The machine used was a workstation with an AMD Ryzen 9 3900X 12-Core CPU running at 3.80 GHz,
64 GB DDR4 2133 MHz of RAM and loading data from an external HDD enclosure connected via USB.
If there is access to additional machines (in a cluster, for example)
it is also possible to parallelize the search even further by splitting the metagenomes into groups,
Expand Down
4 changes: 0 additions & 4 deletions thesis/05-decentralized.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,6 @@

```{=latex}
\begin{epigraphs}
\qitem{A complex system that works is invariably found to have evolved from a simple system that worked.
A complex system designed from scratch never works and cannot be patched up to make it work.
You have to start over with a working simple system.}%
{John Gall}
\qitem{Another flaw in the human character is that everybody wants to build and nobody wants to do maintenance.}%
{Kurt Vonnegut}
\end{epigraphs}
Expand Down
Loading

0 comments on commit 7e44367

Please sign in to comment.