Skip to content

Commit

Permalink
Merge pull request #338 from nickjcroucher/update_rec
Browse files Browse the repository at this point in the history
v3.2.1
  • Loading branch information
nickjcroucher authored May 24, 2022
2 parents a9d5dcd + 5f80ed7 commit eb7b75e
Show file tree
Hide file tree
Showing 56 changed files with 3,184 additions and 2,464 deletions.
13 changes: 12 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,19 @@
# Change Log

## [v3.1.6](https://github.com/sanger-pathogens/gubbins/tree/v3.1.6) (2021-1-20)
## [v3.2.1](https://github.com/sanger-pathogens/gubbins/tree/v3.2.1) (2022-5-24)
[Full Changelog](https://github.com/sanger-pathogens/gubbins/compare/v3.1.6...v3.2.1)

- Fix problem with sequence reconstruction
- Improve detection of small recombinations by modifying window sizes
- Enable resumption of stalled analyses
- Clean C code
- Fixes to scripts
- Add CI tests and update expected results

## [v3.1.6](https://github.com/sanger-pathogens/gubbins/tree/v3.1.6) (2022-1-20)
[Full Changelog](https://github.com/sanger-pathogens/gubbins/compare/v3.1.5...v3.1.6)


- Fix problem with sequence reconstruction
- Add test for consistency of reconstructions

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ chmod +x configure
make
sudo make install
cd python
python setup.py install
python3 -m pip install .
```

### OSX/Linux/Windows - Virtual Machine
Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
3.1.6
3.2.1
23 changes: 16 additions & 7 deletions docs/gubbins_manual.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@ Gubbins was originally designed to use a [joint ancestral state reconstruction](

### Recombination detection options

Recombination is detected using a [spatial scanning statistic](https://link.springer.com/chapter/10.1007/978-1-4612-1578-3_14), which relies on a sliding window. The size of this window may need to be reduced if you apply Gubbins to very small genomes (e.g. viruses).
Recombination is detected using a [spatial scanning statistic](https://link.springer.com/chapter/10.1007/978-1-4612-1578-3_14), which relies on a sliding window. The size of this window may need to be reduced if you apply Gubbins to very small genomes (e.g. viruses). To increase the sensitivity for detecting recombinations, `--min-snps` can be set at the minimum value of 2; the `--p-value` threshold required to detect recombinations can be increased; the `--trimming-ratio` can be raised above 1.0, to disfavour the trimming of recombination edges; and the `--extensive-search` mode can be used.

```
--min-snps MIN_SNPS, -m MIN_SNPS
Expand All @@ -179,19 +179,26 @@ Recombination is detected using a [spatial scanning statistic](https://link.spri
Minimum window size (default: 100)
--max-window-size MAX_WINDOW_SIZE, -b MAX_WINDOW_SIZE
Maximum window size (default: 10000)
--p-value P_VALUE Uncorrected p value used to identify recombinations (default: 0.05)
--trimming-ratio TRIMMING_RATIO
Ratio of log probabilities used to trim recombinations (default: 1.0)
--extensive-search Undertake slower, more thorough, search for recombination (default: False)
```

### Algorithm stop options
### Algorithm stop and restart options

Given the scale of available dataset sizes, and the size of tree space, it is unlikely that any Gubbins analysis will ever converge based on identifying identical trees in subsequent iterations. Note that trees from previous iterations are used as starting trees for inference in subsequent iterations with IQTree and RAxML (although not RAxML-NG). In practice, there is little improvement to the tree after three iterations.
Given the scale of available dataset sizes, and the size of tree space, it is unlikely that any Gubbins analysis will ever converge based on identifying identical trees in subsequent iterations. Normally the algorithm will stop after reaching the maximum number of iterations. Should the run fail or stall before this point, the analysis can be restarted from the last iteration that successfully completed by providing a tree through the `--resume` flag (all other flags should be kept identical to the original commend, including `--iterations`). Note that although only the tree is provided to `--resume`, the corresponding alignment generated at the end of the same iteration also needs to be available within the same directory.

```
--iterations ITERATIONS, -i ITERATIONS
Maximum No. of iterations (default: 5)
--converge-method {weighted_robinson_foulds,robinson_foulds,recombination}, -z {weighted_robinson_foulds,robinson_foulds,recombination}
Criteria to use to know when to halt iterations (default: weighted_robinson_foulds)
--resume RESUME Intermediate tree from previous run (must include "iteration_X" in file name) (default: None)
```

Note that trees from previous iterations are used as starting trees for inference in subsequent iterations with IQTree and RAxML (although not RAxML-NG).

## Output files

A successful Gubbins run will generate files with the suffixes:
Expand Down Expand Up @@ -221,13 +228,15 @@ The `.per_branch_statistics.csv` file contains summary statistics for each branc

* **Node** - Name of the node subtended by the branch. This can either be one of the taxa included in the input alignment, or an internal node, which are numbered
* **Total SNPs** - Total number of base substitutions reconstructed onto the branch
* **Num of SNPs inside recombinations** - Number of base substitutions reconstructed onto the branch that fall within a predicted recombination (*r*)
* **Num of SNPs outside recombinations** - Number of base substitutions reconstructed onto the branch that fall outside of a predicted recombination. i.e. predicted to have arisen by point mutation (*m*)
* **Num of Recombination Blocks** - Total number of recombination blocks reconstructed onto the branch
* **Bases in recombinations** - Total length of all recombination events reconstructed onto the branch
* **Number of SNPs Inside Recombinations** - Number of base substitutions reconstructed onto the branch that fall within a predicted recombination (*r*)
* **Number of SNPs Outside Recombinations** - Number of base substitutions reconstructed onto the branch that fall outside of a predicted recombination. i.e. predicted to have arisen by point mutation (*m*)
* **Number of Recombination Blocks** - Total number of recombination blocks reconstructed onto the branch
* **Bases in Recombinations** - Total length of all recombination events reconstructed onto the branch
* **Cumulative Bases in Recombinations** - Total number of bases in the alignment affected by recombination on this branch and its ancestors
* ***r/m*** - The r/m value for the branch. This value gives a measure of the relative impact of recombination and mutation on the variation accumulated on the branch
* ***rho/theta*** - The ratio of the number of recombination events to point mutations on a branch; a measure of the relative rates of recombination and point mutation
* **Genome Length** - The total number of aligned bases between the ancestral and descendent nodes for the branch excluding any missing data or gaps in either
* **Bases in Clonal Frame** - The number of called bases at the descendant node that have not been affected by recombination on this branch or an ancestor (i.e., the length of sequence that can be used for phylogenetic interpretation)

Note that all positions in the output files are relative to the input alignment. If you wish to compare the positions of recombinations relative to a reference annotation, their coordinates will need to be adjusted to account for any gaps in the reference sequence introduced when generating the alignment.

Expand Down
53 changes: 37 additions & 16 deletions python/gubbins/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,11 +108,28 @@ def parse_and_run(input_args, program_description=""):
gaps_vcf_filename = base_filename + ".gaps.vcf"
joint_sequences_filename = base_filename + ".seq.joint.aln"

# If restarting from a previous run
starting_iteration = 1
if input_args.resume is not None:
search_itr = re.search(r'iteration_(\d+)', input_args.resume)
if search_itr is None:
sys.stderr.write('Resuming a Gubbins run requires a tree file name containing the phrase "iteration_X"\n')
exit(1)
else:
starting_iteration = int(search_itr.group(1)) + 1
if starting_iteration >= input_args.iterations:
sys.stderr.write('Run has already reached the number of specified iterations\n')
exit(1)
else:
sys.stderr.write('Resuming Gubbins analysis at iteration ' + str(starting_iteration) + '\n')
input_args.starting_tree = input_args.resume
current_tree_name = input_args.starting_tree

# Check if intermediate files from a previous run exist
intermediate_files = [basename + ".iteration_"]
if not input_args.no_cleanup:
if not input_args.no_cleanup and input_args.resume is None:
utils.delete_files(".", intermediate_files, "", input_args.verbose)
if utils.do_files_exist(".", intermediate_files, "", input_args.verbose):
if utils.do_files_exist(".", intermediate_files, "", input_args.verbose) and input_args.resume is None:
sys.exit("Intermediate files from a previous run exist. Please rerun without the --no_cleanup option "
"to automatically delete them or with the --use_time_stamp to add a unique prefix.")

Expand Down Expand Up @@ -176,11 +193,11 @@ def parse_and_run(input_args, program_description=""):
reconvert_fasta_file(gaps_alignment_filename, base_filename + ".start")
# Start the main loop
printer.print("\nEntering the main loop.")
for i in range(1, input_args.iterations+1):
for i in range(starting_iteration, input_args.iterations+1):
printer.print("\n*** Iteration " + str(i) + " ***")

# 1.1. Construct the tree-building command depending on the iteration and employed options
if i == 2:
if i == 2 or input_args.resume is not None:
# Select the algorithms used for the subsequent iterations
current_tree_builder, current_model_fitter, current_model, extra_tree_arguments, extra_model_arguments = return_algorithm_choices(input_args,i)
# Initialise tree builder
Expand Down Expand Up @@ -247,7 +264,7 @@ def parse_and_run(input_args, program_description=""):
# 3.2a. Joint ancestral reconstruction
printer.print(["\nReconstructing ancestral sequences with pyjar..."])

if i == 1:
if i == starting_iteration:

# 3.3a. Read alignment and identify unique base patterns in first iteration only

Expand Down Expand Up @@ -281,6 +298,7 @@ def parse_and_run(input_args, program_description=""):
info_filename = info_filename, # file containing evolutionary model parameters
info_filetype = input_args.model_fitter, # model fitter - format of file containing evolutionary model parameters
output_prefix = temp_working_dir + "/" + ancestral_sequence_basename, # output prefix
outgroup_name = input_args.outgroup, # outgroup for rooting and reconstruction
threads = input_args.threads, # number of cores to use
verbose = input_args.verbose,
max_pos = max_pos)
Expand Down Expand Up @@ -354,7 +372,8 @@ def parse_and_run(input_args, program_description=""):
shutil.copyfile(current_tree_name_with_internal_nodes, current_tree_name)
gubbins_command = create_gubbins_command(
gubbins_exec, gaps_alignment_filename, gaps_vcf_filename, current_tree_name,
input_args.alignment_filename, input_args.min_snps, input_args.min_window_size, input_args.max_window_size)
input_args.alignment_filename, input_args.min_snps, input_args.min_window_size, input_args.max_window_size,
input_args.p_value, input_args.trimming_ratio, input_args.extensive_search)
printer.print(["\nRunning Gubbins to detect recombinations...", gubbins_command])
try:
subprocess.check_call(gubbins_command, shell=True)
Expand Down Expand Up @@ -617,13 +636,16 @@ def return_algorithm(algorithm_choice, model, input_args, node_labels = None, ex
return initialised_algorithm

def create_gubbins_command(gubbins_exec, alignment_filename, vcf_filename, current_tree_name,
original_alignment_filename, min_snps, min_window_size, max_window_size):
original_alignment_filename, min_snps, min_window_size, max_window_size,
p_value, trimming_ratio, extensive_search):
command = [gubbins_exec, "-r", "-v", vcf_filename, "-a", str(min_window_size),
"-b", str(max_window_size), "-f", original_alignment_filename, "-t", current_tree_name,
"-m", str(min_snps), alignment_filename]
"-m", str(min_snps), "-p", str(p_value), "-i", str(trimming_ratio)]
if extensive_search:
command.append("-x")
command.append(alignment_filename)
return " ".join(command)


def number_of_sequences_in_alignment(filename):
return len(get_sequence_names_from_alignment(filename))

Expand Down Expand Up @@ -734,14 +756,13 @@ def reroot_tree(tree_name, outgroups):

def reroot_tree_with_outgroup(tree_name, outgroups):
clade_outgroups = get_monophyletic_outgroup(tree_name, outgroups)
outgroups = [{'name': taxon_name} for taxon_name in clade_outgroups]

tree = Phylo.read(tree_name, 'newick')
tree.root_with_outgroup(*outgroups)
Phylo.write(tree, tree_name, 'newick')

tree = dendropy.Tree.get_from_path(tree_name, 'newick', preserve_underscores=True)
tree.deroot()
outgroup_mrca = tree.mrca(taxon_labels=clade_outgroups)
print('Edge length is: ' + str(outgroup_mrca.edge.length))
tree.reroot_at_edge(outgroup_mrca.edge,
length1 = outgroup_mrca.edge.length/2,
length2 = outgroup_mrca.edge.length/2,
update_bipartitions=False)
tree.update_bipartitions()
output_tree_string = tree_as_string(tree, suppress_internal=False)
with open(tree_name, 'w+') as output_file:
Expand Down
Loading

0 comments on commit eb7b75e

Please sign in to comment.