-
Notifications
You must be signed in to change notification settings - Fork 1
GEMSTONE_Isolates_Illumina_PE
v1.0.0
This workflow processes paired-end Illumina reads from bacterial isolates. It performs QA and QC, flagging low quality and contaminated samples. Samples for which reads fail QC are not subject to further analysis. Performs de novo assembly with SPAdes and refines it with Pilon. Finds AMR, virulence, and stress genes with AMRFinderPlus, identifies and types plasmid contigs with MOB-recon, types sequences with TS-MLST, annotates them with Bakta, and infers taxonomy with GAMBIT. Also performs taxa-specific analysis - please refer to TheiaProk documentation for those tasks. Optionally, it estimates taxa abundances with Kraken2 and Bracken and does strain-level identification with StrainGE (based on the GAMBIT predicted taxon).
This workflow was based on the PHB v1.0.0 TheiaProk workflow from Theiagen.
Click to open or hide
Click to open or hide
Boolean
Optional
Default = false
If true
, predicts species based on k-mer similarity (with k = 16) between the assebly and genomes in a Theiagen bacterial database.
Boolean
Optional
Default = false
If true
, identifies taxa and estimates their abundance with Kraken2/Bracken.
Boolean
Optional
Default = false
If true
, performs strain-level detection with StrainGE.
String
Optional
If set, overrides the GAMBIT predicted species when setting a species in AMRFinderPlus and when comparing QC metrics agains the thresholds defined in qc_check_table
. Useful for when GAMBIT predictions are incorrect. If not set, uses GAMBIT the prediction.
Click to open or hide
File
Required
FASTQ file with forward raw reads. Must be Illumina paired-end.
File
Required
FASTQ file with reverse raw reads. Must be Illumina paired-end.
String
Required
Name or ID of the sample.
String
Optional
Genus or species name, as determined in the lab. Must be written in full, with whitespaces (e.g., Escherichia coli and not E. coli nor Escherichia_coli). It is compared against the GAMBIT predicted taxonomy to derive a taxonomy QC flag, corresponding to the qc_taxonomy_check
output. If the lab predicted genus mathces the GAMBIT prediction, then qc_taxonomy_check
is set to PASS
; otherwise, it is set to ALERT
.
Click to open or hide
Float
Optional
Default = 2
Maximum contamination as a percentage (as determined by checkM2) allowed for a sample to pass taxonomy QC. The default of 2 means that the contamination threshold is 2%.
File
Optional
User-defined, taxa-specific, thresholds for QC metrics as a TSV file. If all QC metrics meet the threshold, the qc_check
output variable will read QC_PASS
. Otherwise, the output will read QC_NA
if the task could not proceed or QC_ALERT
followed by a string indicating what metric failed. Each row in the table should be a species or genus, written in full, with underscores instead of whitespaces (matching the format from GAMBIT). Column names should be taxon, followed by the QC metric name. The sample taxa is taken from the gambit_predicted_taxon
value inferred by GAMBIT or from a user-defined expected_taxon
. Example of a qc_check_table
:
taxon |
est_coverage_raw |
est_coverage_clean |
assembly_length_min |
assembly_length_max |
---|---|---|---|---|
Listeria_monocytogenes |
20 |
2800000 |
3200000 |
|
Escherichia_coli |
40 |
4900000 |
6000000 |
|
Shigella |
40 |
4200000 |
4900000 |
|
Salmonella |
30 |
4400000 |
5700000 |
|
Campylobacter |
20 |
1400000 |
2200000 |
|
Vibrio_cholerae |
40 |
3800000 |
4300000 |
|
Vibrio_parahaemolyticus |
40 |
4900000 |
5500000 |
|
Vibrio_vulnificus |
40 |
4700000 |
5300000 |
Click to open or hide
Int
Optional
Expected genome size in bp. Used during read and assembly QC. If not provided, the workflow uses the genome length estimated by QUAST for assembly QC.
Int
Optional
Default = false
If true
, skips read QA/QC.
Int
Optional
Default = 7472
Minimum number of reads (raw and clean) needed to pass QC.
Int
Optional
Default = 2241820
Minimum number of total bp in reads (raw and clean) needed to pass QC.
Int
Optional
Default = 7472
Minimum estimated genome size in bp needed to pass QC (both for raw and clean reads).
Int
Optional
Default = 18040666
Maximum estimated genome size in bp needed to pass QC (both for raw and clean reads).
Int
Optional
Default = 10
Minimum estimated genome coverage needed to pass QC (both for raw and clean reads).
Int
Optional
Default = 40
The proportion of basepairs reads in the forward and reverse read files: A sample will fail the read screening if fewer than these proportion of basepairs are in either the forward or reverse reads files.
Click to open or hide
Int
Optional
Default = 75
Minimum read length in bp required after trimming for it to be included in downstream analyses.
Int
Optional
Default = 20
Average quality of bases in a sliding window needed for those bases to be kept.
Int
Optional
Default = 10
Length in bp of the window used for trimming.
Click to open or hide
File
Optional
Compressed Kraken2/Bracken database as a tar archive. Required if call_kraken
is true
. Please make sure that the archive contains all the files needed to run Bracken (refer to the Bracken docs).
Int
Optional
Default = 150
Input read length.
String
Optional
Default = "G"
Taxonomic level for Bracken abundance estimation. Defaults to genus level (G). Other possible options are K (kingdom level), P (phylum), C (class), O (order), F (family), and S (species)
Int
Optional
Default = 256
Disk size in Gb for the Kraken2/Bracken task.
Int
Optional
Default = 4
Number of CPUs used in the Kraken2/Bracken task task.
Int
Optional
Default = 32
RAM in Gb for the Kraken2/Bracken task task.
Click to open or hide
File
Required
TSV configuration file for StrainGE databases. It should be a table with two columns: one with the database genus name (e.g., Escherichia or Proteus), and another with the path to the tar archive with the StrainGE database for that genus. An example of this table is:
Escherichia | gs://fc-secure-uuid/databases/strainge/escherichia_shigella.tar.gz |
---|---|
Shigella | gs://fc-secure-uuid/databases/strainge/escherichia_shigella.tar.gz |
Pseudomonas | gs://fc-secure-uuid/databases/strainge/pseudomonas.tar.gz |
Staphylococcus | gs://fc-secure-uuid/databases/strainge/staphylococcus.tar.gz |
Proteus | gs://fc-secure-uuid/databases/strainge/proteus.tar.gz |
Klebsiella | gs://fc-secure-uuid/databases/strainge/klebsiella.tar.gz |
Acinetobacter | gs://fc-secure-uuid/databases/strainge/acinetobacter.tar.gz |
Enterobacter | gs://fc-secure-uuid/databases/strainge/enterobacter.tar.gz |
Enterococcus | gs://fc-secure-uuid/databases/strainge/enterococcus.tar.gz |
Int
Optional
Default = 5
Maximum number of strains searched by StrainGST.
Int
Optional
Default = 23
K-mer sized used when creating the StrainGST databases.
Click to open or hide
String
Version of the TheiaProk workflow used.
String
Analysis date.
Click to open or hide
String
"PASS" or "FAIL" result from raw read screening. If the result is "FAIL", the flag is accompanied by the reason for failure.
String
Optional
"PASS" or "FAIL" result from clean read screening. If the result is "FAIL", the flag is accompanied by the reason for failure. If the raw reads did not pass QC, clean_read_screen
will not be returned.
String
Optional
"QC_PASS" or "QC_ALERT" flag for taxon identification. If the lab_predicted_genus
and the taxon prediction from GAMBIT match and the estimated genome contamination is less than contamination_threshold
, this flag is set to "QC_PASS"; otherwise, it is set to "QC_ALERT". If the lab_predicted_genus
is unknown, this flag only depends on the estimated contamination.
File
Optional
Text file containing detailed flags for taxon identification.
- "Lab genus QC" holds the results from the comparison between the
lab_predicted_genus
and the taxon prediction from GAMBIT match: if these match orlab_predicted_genus
is unknown, this flag is set to "QC_PASS"; otherwise, it is set to "QC_ALERT". - "Contamination QC" is set to "QC_PASS" if the estimated contamination by checkM is less than or equal to
contamination_threshold
, and to "QC_ALERT" otherwise. - "Global QC" holds the global QC flag for taxon identification: "QC_PASS" if both "Lab genus QC" and "Contamination QC" are "QC_PASS", or "QC_ALERT" otherwise. It has the same value as
qc_taxonomy_check
.
String
Optional
"QC_PASS"/"QC_ALERT" flag resulting from the comparison between QC metrics and the user-defined thresholds in the qc_check_table
. Returned only if qc_check_table
is provided.
File
Optional
The qc_check_table
with the user-defined thresholds. If no qc_check_table
is provided, read_and_table_qc_check
is not returned.
String
Optional
Global "QC_PASS"/"QC_ALERT" flag. It is set to "QC_ALERT" if and only if at least one of read_and_table_qc_check
, or qc_taxonomy_check
are "QC_ALERT". Otherwise, it is set to "QC_PASS".
String
Optional
Reason for a qc_check
"QC_ALERT" flag.
Click to open or hide
Int
Optional
Number of reads in read1
.
Int
Optional
Number of reads in read2
.
Int
Optional
Number of pairs of reads in read1
and read2
(raw reads).
String
Optional
Version of fastq_scan used.
Int
Optional
Number of reads after QC in read1_clean
.
Int
Optional
Number of reads after QC in read2_clean
.
Int
Optional
Number of read pairs after QC in read1_clean
and read2_clean
(clean reads).
String
Optional
Version of trimmomatic used.
File
Optional
FASTQ file with forward cleaned reads, after QC and de-hosting.
File
Optional
FASTQ file with reverse cleaned reads, after QC and de-hosting.
String
Optional
Name of the BBDuk Docker image used.
Float
Optional
Mean quality score of forward raw reads.
Float
Optional
Mean quality score of reverse raw reads.
Float
Optional
Mean quality score of forward and reverse raw reads.
Float
Optional
Mean quality score of forward and reverse clean reads.
Float
Optional
Mean read length in bp of forward raw reads.
Float
Optional
Mean read length in bp of reverse raw reads.
Float
Optional
Mean read length in bp of forward and reverse raw reads.
Float
Optional
Mean read length in bp of forward and reverse clean reads.
Float
Optional
Estimated coverage of raw reads, given the estimated genome length.
File
Optional
TSV file with read metrics from clean reads, including average read length, number of reads, and estimated genome coverage.
Float
Optional
Estimated coverage of clean reads, given the estimated genome length.
File
Optional
TSV file with read metrics from raw reads, including average read length, number of reads, and estimated genome coverage.
String
Optional
Name of the Docker image used in the CG pipeline (used to get raw and clean reads QC metrics).
Click to open or hide
File
Optional
Assembly FASTA file. This file is the refined assembly output from Pilon. See the shovill pipeline documentation for more information on how the assembly is generated.
File
Optional
Assembly graph as a GFA file. This is the assembly graph from SPAdes, prior to the Pilon refinement. See the shovill pipeline documentation for more information on how the assembly is generated.
String
Optional
Version of the shovill pipeline used for assembly.
File
Optional
Assembly QC report from QUAST as a text file.
String
Optional
QUAST version used for assembly QC.
Int
Optional
Total contig length in bp.
Int
Optional
Total number of contigs in the assembly.
Int
Optional
Assembly N50 value (minimum contig length in bp of the largest contigs containing 50% of the total assembly length) as computed by QUAST.
Float
Optional
Assembly GC percentage as computed by QUAST.
File
Optional
TSV report from checkM2, including the estimated assembly completeness and contamination.
Float
Optional
Estimated genome completeness, as a percentage (i.e., a value of 100 means 100% completeness), estimated by checkM2.
Float
Optional
Estimated genome contamination, as a percentage (i.e., a value of 1 means 1% contamination), estimated by checkM2.
String
Optional
Version of checkM2 used to estimate genome contamination and completeness.
Click to open or hide
GAMBIT
Click to open or hide
File
Optional
Report from GAMBIT as a text file, including the predicted species.
File
Optional
CSV file listing genomes in the GAMBIT database that are most similar to the assembly.
String
Optional
Predicted taxon of the assembly as estimated by GAMBIT.
String
Optional
Rank of the predicted taxon of the assembly as estimated by GAMBIT (e.g. species, genus...).
String
Optional
Name of the Docker image used in the GAMBIT task (for taxon identification).
String
Optional
Version of GAMBIT used for taxon identification.
String
Optional
Version of the GAMBIT database used for taxon identification.
k-mer similarity-based taxonomy identification
Click to open or hide
File
Optional
Results of the k-mer similarity-based taxonomy identification, as a TSV file. Returned only if call_kmerfinder
is true
.
String
Optional
Top hit species of the k-mer similarity-based taxonomy identification. Returned only if call_kmerfinder
is true
.
String
Optional
Query coverage of the top hit result of the k-mer similarity-based taxonomy identification. Returned only if call_kmerfinder
is true
.
String
Optional
Template coverage of the top hit result of the k-mer similarity-based taxonomy identification. Returned only if call_kmerfinder
is true
.
String
Optional
Name of the Docker image used for k-mer similarity-based taxonomy identification. Returned only if call_kmerfinder
is true
.
String
Optional
Reference database used for k-mer similarity-based taxonomy identification. Returned only if call_kmerfinder
is true
.
MLST
Click to open or hide
File
Optional
TSV report with detailed MLST profile, including missing data symbols.
String
Optional
Predicted sequence type.
String
Optional
PubMLST scheme used to infer sequence type.
File
Optional
Allelic profile detected when infering sequence type.
String
Optional
FASTA file containing nucleotide sequences of any alleles that are not in the MLST database.
String
Optional
Version of MLST used to infer sequence type.
String
Optional
Name of the Docker image used in the MLST task (for sequence typing).
Click to open or hide
File
Report of all genes (virulence, stress, and AMR) found by AMRFinderPlus, as a TSV file.
File
Report of AMR genes found by AMRFinderPlus, as a TSV file.
File
Report of stress genes found by AMRFinderPlus, as a TSV file.
File
Report of virulence genes found by AMRFinderPlus, as a TSV file.
String
Comma separated list of core AMR genes found by AMRFinderPlus.
String
Comma separated list of plus AMR genes found by AMRFinderPlus.
String
Comma separated list of stress genes found by AMRFinderPlus.
String
Comma separated list of virulence genes found by AMRFinderPlus.
String
Comma separated list of classes of antimicrobial drugs for which AMR genes were found by AMRFinderPlus.
String
Comma separated list of subclasses of antimicrobial drugs for which AMR genes were found by AMRFinderPlus.
String
Version of AMRFinderPlus used for AMR genotyping.
String
Version of AMRFinderPlus database used for AMR genotyping.
Click to open or hide
File
MAG gene annotations from Bakta in GenBank format.
File
MAG gene annotations from Bakta in GFF3 format.
File
MAG gene annotations from Bakta in TSV format.
File
Summary report of MAG gene annotation from Bakta.
String
Version of Bakta used for MAG gene annotation.
Click to open or hide
File
TSV file with plasmid/chromosome classification of contigs from MOB-recon.
File
TSV file with plasmid typing results from MOB-typer.
File
FASTA file of chromosomal contigs in the MAG.
File
FASTA file of plasmid contigs in the MAG.
String
Name of the MOB-recon/MOB-suite Docker image used for plasmid identification.
String
Version of MOB-recon/MOB-suite used for plasmid identification.
Click to open or hide
Array[File]
Optional
Files with k-merized input reads. The size of the array depends on how many genera are assigned in lab_determined_genus
, but the contents of each file should be the same. Returned only if call_strainge
and straingst_found_db
are true
.
Array[File]
Optional
StrainGST databases used in each call to StrainGE. The size of the array depends on how many genera are assigned in lab_determined_genus
. Returned only if call_strainge
and straingst_found_db
are true
.
Boolean
Optional
Whether a StrainGST database matching the genera in lab_determined_genus
was found. Returned only if call_strainge
is true
.
Array[File]
Optional
Text files of strains found by StrainGST when using each database. Returned only if call_strainge
and straingst_found_db
are true
.
Array[File]
Optional
Reports with StrainGST statistics, including strain relative abundances, with each databased used. Returned only if call_strainge
and straingst_found_db
are true
.
Array[File]
Optional
Concatenated references as FASTA files needed for downstream StrainGR analysis. The size of the array depends on how many genera are assigned in lab_determined_genus
. Returned only if call_strainge
and straingst_found_db
are true
.
Array[File]
Optional
Indexed BAM file of clean reads mapped to the concatenated references (in straingr_concat_fasta
). The size of the array depends on how many genera are assigned in lab_determined_genus
. Returned only if call_strainge
and straingst_found_db
are true
.
Array[File]
Optional
HDF5 files with variant calling results from StrainGR. The size of the array depends on how many genera are assigned in lab_determined_genus
. Returned only if call_strainge
and straingst_found_db
are true
.
Array[File]
Optional
StrainGR reports with variant calling statistics. The size of the array depends on how many genera are assigned in lab_determined_genus
. Returned only if call_strainge
and straingst_found_db
are true
.
Array[String]
Optional
Name of the StrainGE Docker image used for strain-level detection. The size of the array depends on how many genera are assigned in lab_determined_genus
. Returned only if call_strainge
is true
.
Array[String]
Optional
Version of StrainGE used for strain-level detection. The size of the array depends on how many genera are assigned in lab_determined_genus
. Returned only if call_strainge
is true
.
Marco Teixeira
Colin Worby