This is a tool to generate a Gambit database. Its primary input is a spreadsheet from GTDB such as this release.
This software can be installed using pip:
pip install .
pip install git+
Additional dependancies are:
- ncbi-genome-download
- GAMBITtools
You can use docker to run the software. To build from scratch run:
docker build -t gambitdb .
then to run one of the scripts:
docker run -v $(pwd):/data gambitdb gambitdb -h
There is a single shell script for creating a database from a GTDB spreadsheet (one for bacteria, one for archaea). This script will download the data, create a GAMBIT database and signatures files, then check the recall against the downloaded files for QC purposes. It is the easiest way to create a database.
To create a database in one step, execute the
./ /path/to/working_directory /path/to/gtdb_spreadsheet.tsv num_cores
This script will create a GAMBIT database from a GTDB spreadsheet. It parses the spreadsheet, downloads data with ncbi-genome-download and outputs a GAMBIT database.
This script will parse a GTDB spreadsheet (see and output a list of accessions to download, a species taxon file and a genome metadata file. It is the first step in creating a database. The script provides several options for customization, including the ability to set a maximum number of contigs, include derived samples from metagenomes, environment, single cell, include novel species, set a minimum number of genomes in a species, and specify output filenames for the taxonomy, genome metadata, and genome accessions for download.
To run this script, use the following command:
gambitdb-gtdb /path/to/gtdb_spreadsheet.tsv
The parameters for the script are:
usage: gambitdb-gtdb [options]
Given a GTDB metadata spreadsheet, output a list of accessions to download, a species taxonid file and a genome metadata file
positional arguments:
GTDB metadata file such as bac120_metadata_r214.tsv
-h, --help show this help message and exit
Minimum checkm completeness of the genome [0-100] (default: 97.0)
Maximum checkm contamination of the genome [0-100] (default: 2.0)
--max_contigs MAX_CONTIGS, -d MAX_CONTIGS
Maximum number of contigs. Please note some species systematically assemble poorly with short read data. (default: 100)
--include_derived_samples, -e
Include mixed samples from metagenomes, environment, single cell (default: False)
--include_novel_species, -f
Include novel species called sp12345. The genus must be known (default: False)
Minimum number of genomes in a species, otherwise exclude the species (default: 2)
Only include species that match this string (default: )
Output filename for with the taxonomy (default: species_taxa.csv)
Genome metadata (default: assembly_metadata.csv)
Genome accessions for download (default: accessions_to_download.csv)
--debug Turn on debugging (default: False)
--verbose, -v Turn on verbose output (default: False)
The gambitdb
script is used to generate a Gambit database. It requires a directory containing assemblies in FASTA format and a CSV file containing the assembly file and path, and the species taxon ID. Optionally, it can also take a CSV containing species taxonomy. The script provides several options for customization, including the ability to specify species and accessions to remove, output directory, output filenames, k-mer length, k-mer prefix, minimum number of genomes for a species to be included, number of CPUs to use, and parameters for including a species in a small cluster.
To run this script, use the following command:
gambitdb [options] <assembly_directory> <genome_assembly_metadata> <species_taxon_filename>
The parameters for the script are:
usage: gambitdb [options]
Generate a Gambit database
positional arguments:
assembly_directory A directory containing assemblies in FASTA format
A CSV containing the assembly file and path, and the species taxon ID
CSV containing species taxonomy, may be generated automatically from assembly metadata if missing
-h, --help show this help message and exit
Optional file containing a list of species to remove (1 per line) (default: None)
Optional file containing a list of accession numbers to remove (1 per line) (default: None)
Output directory (default: output_dir)
Output filename for genome signatures (default:
Output filename for core database (default: database.gdb)
Output filename for a list of accessions removed (default: accessions_removed.csv)
Output filename for a list of species removed (default: species_removed.csv)
Output filename for a list of species taxon IDs (default: species_taxon.csv)
Output filename for a list of genome assembly metadata (default: genome_assembly_metadata.csv)
--kmer KMER, -k KMER Length of the k-mer to use (default: 11)
--kmer_prefix KMER_PREFIX, -f KMER_PREFIX
Kmer prefix (default: ATGAC)
Minimum number of genomes for a species to be included (default: 1)
--cpus CPUS, -p CPUS Number of cpus to use (default: 1)
--small_cluster_ngenomes SMALL_CLUSTER_NGENOMES
Minimum number of genomes for a species to be included in a small cluster, along with --small_cluster_diameter (default: 4)
--small_cluster_diameter SMALL_CLUSTER_DIAMETER
Maximum diameter of a species to be included in a small cluster along with --small_cluster_ngenomes (default: 0.7)
--maximum_diameter MAXIMUM_DIAMETER
The maximum diameter to allow before attempting to split a species into subspecies (default: 0.7)
--minimum_cluster_size MINIMUM_CLUSTER_SIZE
After splitting a species into subspecies, this is the minimum number of genomes which must be present, otherwise the genome is removed. (default: 2)
--debug Turn on debugging (default: False)
--verbose, -v Turn on verbose output (default: False)
The gambitdb-create
script is used to generate a GAMBIT database. It requires preprocessed input files including a CSV containing the assembly file and path, and the species taxon ID, a CSV containing species taxonomy, and a signatures .h5 file created by gambit signatures.
To run this script, use the following command:
gambitdb-create [options] genome_assembly_metadata species_taxon_filename signatures_filename
The parameters for the script are:
usage: gambitdb-create [options]
Generate a GAMBIT database. Requires preprocessed input files
positional arguments:
A CSV containing the assembly file and path, and the species taxon ID
A CSV containing species taxonomy
signatures_filename A signatures .h5 file created by gambit signatures
-h, --help show this help message and exit
--db_key DB_KEY Unique key for database, no spaces (default: organisation/database)
--db_version DB_VERSION
Unique version, x.y.z (default: 1.0.0)
--db_author DB_AUTHOR
Name of person who created the database (default: Jane Doe)
--db_date DB_DATE Date database was created as YYYY-MM-DD (default: 2022-12-31)
Output filename for genome signatures (default:
Output filename for core database (default: database.gdb)
--verbose, -v Turn on verbose output (default: False)
The gambitdb-apply-patch
script is used to apply a patch to an existing GAMBIT database. It requires a signatures .h5 file created by gambit signatures, an SQLite database, a signatures .h5 file created by gambit signatures, and an SQLite database. The script provides several options for customization, including the ability to specify output filenames and turn on verbose output.
To run this script, use the following command:
gambitdb-apply-patch [options] <signatures_main_filename> <database_main_filename> <signatures_patch_filename> <database_patch_filename>
The parameters for the script are:
usage: gambitdb-apply-patch [options]
Given two GAMBIT signatures files, merge them and return a new file.
positional arguments:
A signatures .h5 file created by gambit signatures
An SQLite database
A signatures .h5 file created by gambit signatures
An SQLite database
-h, --help show this help message and exit
Output filename for genome signatures (default:
Output filename for database (default: patched_database.gdb)
--signatures_main_removed_filename SIGNATURES_MAIN_REMOVED_FILENAME
Output filename for genome signatures with patched genomes removed (default:
--database_main_removed_filename DATABASE_MAIN_REMOVED_FILENAME
Output filename for database with patched genomes removed (default: main_database_removed.gdb)
--verbose, -v Turn on verbose output (default: False)
The gambitdb-compress
script is used to compress a GAMBIT database by filtering out genomes based on specified criteria. It requires a signatures .h5 file created by gambit signatures and an SQLite database. The script provides several options for customization, including the ability to specify output filenames, the minimum number of genomes in a species to consider, the maximum number of genomes in a species to consider, the proportion of genomes a kmer must be in for a species to be considered core, the number of cpus to use, the number of genomes to keep for a species, and whether to keep species under minima rather than removing them.
To run this script, use the following command:
gambitdb-compress [options] <signatures_filename> <database_filename>
The parameters for the script are:
usage: gambitdb-compress [options]
Compresses a GAMBIT database by filtering out genomes based on specified criteria.
positional arguments:
signatures_filename A signatures .h5 file created by gambit signatures
database_filename An sqlite database file created by gambit
-h, --help show this help message and exit
Minimum number of genomes in a species to consider, ignore the species below this (default: 10)
Max number of genomes in a species to consider, ignore all others above this (default: 100)
Proportion of genomes a kmer must be in for a species to be considered core (default: 1)
--cpus CPUS, -p CPUS Number of cpus to use (default: 1)
Number of genomes to keep for a species (0 means keep all) (default: 1)
Keep species under minima rather than removing them (default: False)
Output filename for genome signatures (default:
Output filename for genome database (default: filtered_database.gdb)
--verbose, -v Turn on verbose output (default: False)
The gambitdb-remove-genome-signatures
script is used to remove a list of genomes from a GAMBIT database. It requires a signatures .h5 file created by gambit signatures and a list of genomes to remove. The script provides several options for customization, including the ability to specify output filenames and turn on verbose output.
To run this script, use the following command:
gambitdb-remove-genome-signatures [options] <signatures_filename> <genomes_to_remove_filename>
The parameters for the script are:
usage: gambitdb-remove-genome-signatures [options]
Given a Gambit signatures file, remove a list of genomes from it and return a new file.
positional arguments:
signatures_filename A signatures .h5 file created by gambit signatures
One accession per line in a file
-h, --help show this help message and exit
Output filename for genome signatures (default:
--verbose, -v Turn on verbose output (default: False)
The gambitdb-repair-db
script is used to repair a GAMBIT database. It requires an SQLite database. The script provides several options for customization, including the ability to specify output filenames and turn on verbose output.
To run this script, use the following command:
gambitdb-repair-db [options] <database_main_filename>
The parameters for the script are:
usage: gambitdb-repair-db [options]
Given a GAMBIT database, repair it
positional arguments:
An SQLite database
-h, --help show this help message and exit
Output filename for database (default: fixed_database.gdb)
--verbose, -v Turn on verbose output (default: False)
The gambitdb-rebuild-signatures
script is used to rebuild a GAMBIT database. It requires a signatures .h5 file created by gambit signatures. The script provides several options for customization, including the ability to specify output filenames, database key, database version, database author, database date, and turn on verbose output.
To run this script, use the following command:
gambitdb-rebuild-signatures [options] <signatures_filename>
The parameters for the script are:
usage: gambitdb-rebuild-signatures [options]
Given a signatures file, rebuild the database.
positional arguments:
signatures_filename A signatures .h5 file created by gambit signatures
-h, --help show this help message and exit
--db_key DB_KEY Unique key for database, no spaces (default: organisation/database)
--db_version DB_VERSION
Unique version, x.y.z (default: 1.0.0)
--db_author DB_AUTHOR
Name of person who created the database (default: Jane Doe)
--db_date DB_DATE Date database was created as YYYY-MM-DD (default: 2022-12-31)
Output filename for genome signatures (default:
Output filename for core database (default: database.gdb)
--verbose, -v Turn on verbose output (default: False)
The gambitdb-iterative-build
script is used to iteratively build a GAMBIT database. It requires a signatures .h5 file created by gambit signatures, an SQLite database, and a GTDB spreadsheet. The script provides several options for customization, including the ability to specify output filenames, the minimum number of genomes a species must have, a list of species to ignore, the taxonomic rank to use, and the number of cpus to use.
To run this script, use the following command:
gambitdb-iterative-build [options] <signatures_main_filename> <database_main_filename> <gtdb_metadata_spreadsheet>
The parameters for the script are:
usage: gambitdb-iterative-build [options]
Iteratively add species to a database
positional arguments:
A signatures .h5 file created by gambit signatures
An SQLite database
GTDB database
-h, --help show this help message and exit
--species_added SPECIES_ADDED
file containing a list of species added (default: species_added)
--min_genomes MIN_GENOMES
minimum genomes a species must have (default: 2)
--species_to_ignore SPECIES_TO_IGNORE
file containing a list of species to ignore (default: None)
--rank RANK, -r RANK taxonomic rank (genus/species) (default: species)
--cpus CPUS, -p CPUS Number of cpus to use (default: 1)
These are scripts which allow you to access functionality deep within the GAMBITdb software and are mostly for advanced usage, debugging or restarting a failed database run partway through.
The gambitdb-curate
script is used to curate a GAMBIT database. It requires a species taxon file, a genome metadata file, a directory containing assemblies in FASTA format, and a pairwise distance file between each assembly. The script provides several options for customization, including the ability to specify species and accessions to remove, output filenames, the minimum number of genomes for a species to be included, the number of cpus to use, the number of genomes for a species to be included in a small cluster, the maximum diameter of a species to be included in a small cluster, the maximum diameter to allow before attempting to split a species into subspecies, and the minimum number of genomes which must be present after splitting a species into subspecies.
To run this script, use the following command:
gambitdb-curate [options] <species_taxon_filename> <genome_assembly_metadata> <assembly_directory> <pairwise_distances_filename>
The parameters for the script are:
usage: gambitdb-curate [options]
Given a species taxon file, and a genome file with metadata, curate the data and produce new files
positional arguments:
CSV containing species taxonomy and diameters, ngenomes - output from gambitdb-diameters
A CSV containing the assembly file and path, and the species taxon ID - output from gambitdb-gtdb
assembly_directory A directory containing assemblies in FASTA format
A pairwise distance file between each assembly
-h, --help show this help message and exit
Optional file containing a list of species to remove (1 per line) (default: None)
Optional file containing a list of accession numbers to remove (1 per line) (default: None)
Output filename for genome signatures (default: species_taxon_curated.csv)
Output filename for core database (default: genome_assembly_metadata_curated.csv)
Output filename for a list of accessions removed (default: accessions_removed.csv)
Output filename for a list of species removed (default: species_removed.csv)
Minimum number of genomes for a species to be included (default: 2)
--small_cluster_ngenomes SMALL_CLUSTER_NGENOMES
Minimum number of genomes for a species to be included in a small cluster, along with --small_cluster_diameter (default: 4)
--small_cluster_diameter SMALL_CLUSTER_DIAMETER
Maximum diameter of a species to be included in a small cluster along with --small_cluster_ngenomes (default: 0.7)
--maximum_diameter MAXIMUM_DIAMETER
The maximum diameter to allow before attempting to split a species into subspecies (default: 0.7)
--minimum_cluster_size MINIMUM_CLUSTER_SIZE
After splitting a species into subspecies, this is the minimum number of genomes which must be present, otherwise the genome is removed. (default: 2)
--debug Turn on debugging (default: False)
--verbose, -v Turn on verbose output (default: False)
The gambitdb-diameters
script is used to calculate the diameters of species in a GAMBIT database. It requires a species taxon file, a pairwise distance file between each assembly, and a CSV containing species taxon IDs. The script provides several options for customization, including the ability to specify output filenames and turn on verbose output.
To run this script, use the following command:
gambitdb-diameters [options] <genome_assembly_metadata> <pairwise_distances_filename> <species_taxon_filename>
The parameters for the script are:
usage: gambitdb-diameters [options]
Given files containing assembly metadata, pairwise distances and species taxon information output a new species file with diameters, and a min-inter file
positional arguments:
A CSV containing the assembly file and path, and the species taxon ID
A pairwise distance file between each assembly
A CSV containing species taxon IDs
-h, --help show this help message and exit
Output filename for the modified species taxon IDs plus diameters (default: species_data_diameters.csv)
Output filename for min inter values (default: min_inter.csv)
--debug Turn on debugging (default: False)
--verbose, -v Turn on verbose output (default: False)
The gambitdb-gtdb-testset
script is used to check a GAMBIT database built using GTDB against genomes which were not used in the training set. It requires a species taxon file, a genome metadata file, and a GTDB spreadsheet. The script provides several options for customization, including the ability to specify output filenames, the maximum number of genomes per species, and turn on verbose output.
To run this script, use the following command:
gambitdb-gtdb-testset [options] <species_taxon_file> <assembly_metadata_file> <gtdb_metadata_file>
The parameters for the script are:
usage: gambitdb-gtdb-testset [options]
Check a gambit database built using GTDB against genomes which were not used in the training set. Produces a list of genomes to use and their predicted species (GTDB)
positional arguments:
species_taxon_file Species taxon file produced by gambitdb, e.g. species_taxon.csv
Assembly metadata file produced by gambitdb, e.g. genome_assembly_metadata.csv
gtdb_metadata_file GTDB spreadsheet
-h, --help show this help message and exit
Output assemblies for download filename. These can be used with ncbi-genome-downloader (default: assemblies_for_download.txt)
Output assembly to species filename (default: species_to_assembly.csv)
Max genomes per species (default: 5)
--debug Turn on debugging (default: False)
--verbose, -v Turn on verbose output (default: False)
The gambitdb-pairwise-table
script is used to generate a table of pairwise distances between assemblies. It requires a directory containing assemblies in FASTA format. The script provides several options for customization, including the ability to specify output filenames, the k-mer length, the k-mer prefix, the number of cpus to use, and turn on verbose output.
To run this script, use the following command:
gambitdb-pairwise-table [options] <assembly_directory>
The parameters for the script are:
usage: gambitdb-pairwise-table [options]
Given a directory of assemblies in FASTA format, generate a table of pairwise distances
positional arguments:
assembly_directory A directory containing assemblies in FASTA format
-h, --help show this help message and exit
Optional file containing a list of accession numbers to remove/ignore if found in assembly directory (1 per line) (default: None)
Output filename for genome signatures (default: signatures.h5)
Output filename for pairwise distance table (default: pw-dists.csv)
--kmer KMER, -k KMER Length of the k-mer to use (default: 11)
--kmer_prefix KMER_PREFIX, -f KMER_PREFIX
Kmer prefix (default: ATGAC)
--cpus CPUS, -c CPUS Number of cpus to use (default: 1)
--debug Turn on debugging (default: False)
--verbose, -v Turn on verbose output (default: False)
The gambitdb-merge-signatures
script is used to merge two GAMBIT signatures files. It requires two signatures .h5 files created by gambit signatures. The script provides several options for customization, including the ability to specify output filenames and turn on verbose output.
To run this script, use the following command:
gambitdb-merge-signatures [options] <signatures_main_filename> <signatures_patch_filename>
The parameters for the script are:
usage: gambitdb-merge-signatures [options]
Given two Gambit signatures files, merge them and return a new file.
positional arguments:
A signatures .h5 file created by gambit signatures
A patch signatures .h5 file created by gambit signatures
-h, --help show this help message and exit
Output filename for genome signatures (default:
--verbose, -v Turn on verbose output (default: False)
Contributions to this project are welcome. To contribute, please fork the repository and submit a pull request.
This project is licensed under the GNU GPL 3 License - see the LICENSE file for details.