Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Module2_DAY1_Tutorial.md #2

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 37 additions & 30 deletions Module2_DAY1_Tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,11 @@ home: https://bioinformaticsdotca.github.io/MIC_2021
description: MIC 2021 Module 2 Lab
---

This lab is based on the [16S rRNA gene analysis](https://link.springer.com/protocol/10.1007%2F978-1-4939-8728-3_8) method originally developed by Michael Hall and Robert Beiko, and was updated and modified by Diana Haider for the 2021 Microbiome workshop.
This lab is based on the [16S rRNA gene analysis](https://link.springer.com/protocol/10.1007%2F978-1-4939-8728-3_8) method originally developed by Michael Hall and Robert Beiko, and was updated and modified by Diana Haider for the 2022 Microbiome workshop.

## Introduction

The lab in this module is a walkthrough of an end-to-end pipeline using the command line interface for the analysis of high-throughput marker gene data including diversity, taxonomic and statistical explorations. The pipeline described is embedded in the latest version of QIIME2 (version 2021.4, and stands for Quantitative Insights into Microbial Ecology) which is a popular microbiome bioinformatics platform for microbial ecology. It uses software packages called plugins that can be developed by anyone and generates zipped artifacts that can be tracked through a provenance file automatically generated by QIIME2. Two scripts, accessible [here]( https://github.com/beiko-lab/CBW2021_Module2_16S_Analysis), make the whole thing work: “[download_Willis_et_al_data.sh](https://github.com/beiko-lab/CBW2021_Module2_16S_Analysis/blob/main/download_Willis_et_al_data.sh)” retrieves the sequences from the European Bioinformatics Institute through their relatively awesome API, and “[qiime_commands.sh](https://github.com/beiko-lab/CBW2021_Module2_16S_Analysis/blob/main/qiime_commands.sh)” to execute a chain of QIIME2 commands and generate a number of .qza and .qzv files. These files can be unzipped to reveal the sweet contents inside in case you want to (hypothetically, of course) feed them to a non-QIIME2 application. For this tutorial, we will go through each step individually after the import section, starting with the [CBW_Willis_reads.qza](http://quoc.ca/static/CBW_Willis_reads.qza) file.
The lab in this module is a walkthrough of an end-to-end pipeline using the command line interface for the analysis of high-throughput marker gene data including diversity, taxonomic and statistical explorations. The pipeline described is embedded in the latest version of QIIME2 (Quantitative Insights into Microbial Ecology version 2022.2) which is a popular microbiome bioinformatics platform for microbial ecology. It uses software packages called plugins that can be developed by anyone and generates zipped artifacts that can be tracked through a provenance file automatically generated by QIIME2. Two scripts, accessible [here]( https://github.com/beiko-lab/CBW2021_Module2_16S_Analysis), make the whole thing work: “[download_Willis_et_al_data.sh](https://github.com/beiko-lab/CBW2021_Module2_16S_Analysis/blob/main/download_Willis_et_al_data.sh)” retrieves the sequences from the European Bioinformatics Institute (EBI) through their relatively awesome API, and “[qiime_commands.sh](https://github.com/beiko-lab/CBW2021_Module2_16S_Analysis/blob/main/qiime_commands.sh)” to execute a chain of QIIME2 commands and generate a number of .qza and .qzv files. These files can be unzipped to reveal the sweet contents inside in case you want to (hypothetically, of course) feed them to a non-QIIME2 application. For this tutorial, we will go through each step individually after the import section, starting with the [CBW_Willis_reads.qza](http://quoc.ca/static/CBW_Willis_reads.qza) file.


Commonly used marker genes for microbiome analysis include the 16S ribosomal RNA (rRNA) for prokaryotes, 18S rRNA for eukaryotes, and the internal transcribed spacer (ITS) for fungi. In this tutorial, we will explore a 16S rRNA dataset from environmental samples collected along three stations from the Halifax Line at four different depths, amplified using two different primers. One interesting question is whether the primer choice under- or overrepresents certain taxa groups, and then impacts diversity measurements. Details on the data collection and processing can be found in [the original paper](https://academic.oup.com/femsle/article/366/13/fnz152/5538761) that published the data.
Expand All @@ -24,15 +24,16 @@ After this lab you will be able to:
* Know of many tools available to conduct statistical and diversity analyses

## Getting started
QIIME2 is recommended to be run in an independent conda environment to ensure consistency between different versions of the packages it uses. [Here](https://docs.qiime2.org/2021.4/install/) are the instructions for installation. For this instance, a qiime2 environment was already installed, it just needs to be activated.
QIIME2 is recommended to be run in an independent conda environment to ensure consistency between different versions of the packages it uses. [Here](https://docs.qiime2.org/2022.2/install/native/#install-qiime-2-within-a-conda-environment) are the instructions for installation. For this instance, a qiime2 environment was already installed, it just needs to be activated.

```
source activate qiime2-2021.4
source activate qiime2-2022.2
```

You can visualize all available plugins by typing ```qiime```. If it prints out all commands, you're good to go.

<img src="https://github.com/beiko-lab/CBW2021_Module2_16S_Analysis/blob/main/qiime_solutions/images/qiime_terminal.png?raw=true" alt="image 1" width="750" />
<img src="https://github.com/beiko-lab/CBW2021_Module2_16S_Analysis/blob/main/qiime_solutions/images/qiime_terminal.png" width="300">

To start your analysis, two files are required:
* Sequencing data
* Metadata
Expand Down Expand Up @@ -65,38 +66,44 @@ In the interest of time, we will skip the import step, and the denoising step. T

## 16S data processing
### Import raw reads to QIIME2
You can do this step yourself, but for this live lab, we will start with the already imported [CBW_Willis_reads.qza](http://quoc.ca/static/CBW_Willis_reads.qza) file.
If you want to fetch the dataset from EBI yourself, you can run ``` bash download_Willis_et_al_data.sh ``` bash script, but it is time consuming (~30 mins). The script downloads the FASTQ files from the EBI database, and creates a [manifest](https://github.com/beiko-lab/CBW2021_Module2_16S_Analysis/blob/main/sequence_data/MANIFEST.txt) file which is a text file with three columns. It guides the import in QIIME2 by providing the sample name, the file path and the direction of the read if the data has paired reads (either reverse or forward).

Once the manifest file is ready, you can just use the qiime import function:

```
qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path import_to_qiime \
--output-path CBW_Willis_reads
```
**Output**
* CBW_Willis_reads.qza: [download](http://quoc.ca/static/CBW_Willis_reads.qza)

Since we won't run the import step ourselves to save time, we can just download the CBW_Willis_reads.qza file using wget in the sequence_data folder:
You can do this step yourself, but for this live lab, we will start with the already imported [CBW_Willis_reads.qza](http://quoc.ca/static/CBW_Willis_reads.qza) file. It will save us some time, we can just download the CBW_Willis_reads.qza file using wget into the sequence_data folder:

First, we will cd (change directory) to move to our workspace
```
cd workspace/
mkdir Module2
cd Module2
```
Then, we will create a soft link of the qiime artifacts into our workspace so you can use the pre-cooked files for the commands that take a long time to run. A soft link is used to tell your computer where some files are, without having to copy them and take extra space on your hard drive.
```
ln -s ~/CourseData/MIC_data/Module2/qiime_artifacts/
ln -s ~/CourseData/Module2/qiime_artifacts/
```
Now, let's just move to the qiime_artifacts folder.
```
cd qiime_artifacts/
```
From here on, let's just use wget to retrieve the already imported .qza reads from the web.
```
wget http://quoc.ca/static/CBW_Willis_reads.qza
```

If you want to try to fetch the dataset from EBI yourself on your own time, you can run ``` bash download_Willis_et_al_data.sh ``` bash script, but it is time consuming (~30 mins). The bash script downloads the FASTQ files from the EBI database, and creates a [manifest](https://github.com/beiko-lab/CBW2021_Module2_16S_Analysis/blob/main/sequence_data/MANIFEST.txt) file which is a text file with three columns. Manifest files guide the import in QIIME2 by providing the sample name, the file path to the FASTQ, and the direction of each file if the data has paired reads (either reverse or forward).

Once the manifest file is ready after you successfully ran ``` bash download_Willis_et_al_data.sh ```, you can just use the qiime import function:

```
qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path import_to_qiime \
--output-path CBW_Willis_reads
```
**Output**
* CBW_Willis_reads.qza: [download](http://quoc.ca/static/CBW_Willis_reads.qza)


### Quality control

Let's start here. The sequences first undergo quality control, where we decide on a quality trimming threshold based on the quality profiles of the reads. During sequencing, each base called is given a base calling accuracy metric measured with Phred quality scores. The goal is to retain the most bases so the read is long enough to be informative, while removing bases that have low quality scores since they can increase the risk of false-positives. QIIME2 has a visualizer that helps you decide on a threshold.
Let's start here. The sequences first undergo quality control, where we decide on a quality trimming threshold based on the quality profiles of the reads. During sequencing, each base called is given a base calling accuracy metric measured with Phred quality scores. The goal is to retain the most bases so the read is long enough to be informative, while removing bases that have low quality scores since they can increase the risk of false-positives. Careful when you are trimming your reads: it is equally important to keep at least 20nt (by default) overlap between your forward and reverse reads to have a successful joining step. QIIME2 has a visualizer that helps you decide on a threshold based on average quality scores.

This table adapted from [Wikipedia](https://en.wikipedia.org/wiki/Phred_quality_score) can help us interpret quality scores. Generally, a score >20 is acceptable, and >30 is really good.

Expand All @@ -120,13 +127,13 @@ qiime demux summarize \
**Output**
* qual_viz.qzv: [download](https://github.com/beiko-lab/CBW2021_Module2_16S_Analysis/raw/main/qiime_solutions/qual_viz.qzv) / [visualize](https://view.qiime2.org/visualization/?type=html&src=https://quoc.ca/data-artifact?result_id=181)

QIIME has an interface that can view .qza or .qzv files. One motivation to view a qza files is to inspect it's provenance (telling you from which file it comes from, with which parameters it was generated). Qzv are visualizations that we will generate throughout the tutorial. To see them, you have to download the file to your local machine.
QIIME has an interface that can view .qza or .qzv files. Among others, one motivation to view a qza file is to inspect it's provenance. Provenance files keep track of which file the qzv comes from, and with which parameters it was generated. Qzvs are visualizations that we will generate throughout the tutorial. To see them, you have to download the file to your local machine.
1. Type your public ip in a browser
2. Download the .qzv we want to see into a repository in your computer
3. Open [QIIME's Viewer](https://view.qiime2.org)
4. Upload the .qzv to QIIME's Viewer

Another option when the files are in your local machine is to use qiime's view command which will open a browser. Careful, to run this, you need to be in the correct environment (QIIME2's) and correct repository with your files of interest.
Another option when the files are in your local machine is to use qiime's view command from your terminal which will open up the visualization in a browser. Careful, to run this, you need to be in the correct environment (QIIME2's) and correct repository with your files of interest.

```
qiime tools view \
Expand All @@ -145,6 +152,8 @@ This function randomly samples 10 000 of your reads without replacement, and pre

From the quality score boxplot, we decided to trim our sequences at 240 for both forward and reverse reads. I also ran the table with a different trim length, and will leave the file in the [qiime_solutions/f290r240/](https://github.com/beiko-lab/CBW2021_Module2_16S_Analysis/tree/main/qiime_solutions/f290r240) if you want to compare it on your own time. While the original authors clustered their sequences into OTUs, we will make ASVs using DADA2. DADA2 clusters sequences based on quality score, and discards reads that have a high probability to be an error.

Denoising can take ~30mins, so we will use the outputs below that we cooked in advance! The outputs are the foundation for the rest of this analysis.

```
qiime dada2 denoise-paired \
--i-demultiplexed-seqs CBW_Willis_reads.qza \
Expand All @@ -161,10 +170,8 @@ qiime dada2 denoise-paired \
* representative_sequences.qza : [download](https://github.com/beiko-lab/CBW2021_Module2_16S_Analysis/raw/main/qiime_artifacts/representative_sequences.qza)
* denoise_stats.qza : [download](https://github.com/beiko-lab/CBW2021_Module2_16S_Analysis/raw/main/qiime_artifacts/denoise_stats.qza)

<img src="https://github.com/beiko-lab/CBW2021_Module2_16S_Analysis/blob/main/qiime_solutions/images/DADA2_output.png?raw=true" alt="image 1" width="750" />

<img src="https://github.com/beiko-lab/CBW2021_Module2_16S_Analysis/blob/main/qiime_solutions/images/DADA2_output.png" width="600">

Denoising can take ~30mins, so we will use the outputs above that we cooked in advance! The outputs are the foundation for the rest of this analysis.

## From the [feature table](https://github.com/beiko-lab/CBW2021_Module2_16S_Analysis/raw/main/qiime_artifacts/unfiltered_table.qza)

Expand All @@ -189,7 +196,7 @@ It's hard to make ASVs interesting, so let's add their taxonomic identification.

First, we need to download the classifier from QIIME2's database.
```
wget https://data.qiime2.org/2021.4/common/gg-13-8-99-nb-classifier.qza
wget https://data.qiime2.org/2022.2/common/gg-13-8-99-nb-classifier.qza
```
Then, we can classify our data.

Expand Down Expand Up @@ -261,7 +268,7 @@ qiime phylogeny midpoint-root \

Here's the rooted tree visualized on iTol:

<img src="https://github.com/beiko-lab/CBW2021_Module2_16S_Analysis/blob/main/qiime_solutions/images/rooted-tree.png?raw=true" alt="image 1" width="750" />
<img src="https://github.com/beiko-lab/CBW2021_Module2_16S_Analysis/blob/main/qiime_solutions/images/rooted-tree.png" width="300">

### Diversity analyses (Barplots and heatmaps)

Expand Down Expand Up @@ -378,7 +385,7 @@ qiime composition ancom \
**Visualization outputs**
* ancom-size.qzv : [download](https://github.com/beiko-lab/CBW2021_Module2_16S_Analysis/raw/main/qiime_solutions/ancom-size.qzv) / [visualize](https://view.qiime2.org/visualization/?type=html&src=https://quoc.ca/data-artifact?result_id=174)

**Question** : Which taxa are significantly different across different size fraction samples? Can you find their taxonomic ID?
**Question** : Which taxa are significantly different across different 16S variable region samples? Can you find their taxonomic ID?

Let's return the same previous command, but change the metadata-column to Depth this time.

Expand Down