Merge pull request #338 from nickjcroucher/update_rec

v3.2.1
nickjcroucher · May 24, 2022 · eb7b75e · eb7b75e
2 parents a9d5dcd + 5f80ed7
commit eb7b75e
Show file tree

Hide file tree

Showing 56 changed files with 3,184 additions and 2,464 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,8 +1,19 @@
 # Change Log
 
-## [v3.1.6](https://github.com/sanger-pathogens/gubbins/tree/v3.1.6) (2021-1-20)
+## [v3.2.1](https://github.com/sanger-pathogens/gubbins/tree/v3.2.1) (2022-5-24)
+[Full Changelog](https://github.com/sanger-pathogens/gubbins/compare/v3.1.6...v3.2.1)
+
+- Fix problem with sequence reconstruction
+- Improve detection of small recombinations by modifying window sizes
+- Enable resumption of stalled analyses
+- Clean C code
+- Fixes to scripts
+- Add CI tests and update expected results
+
+## [v3.1.6](https://github.com/sanger-pathogens/gubbins/tree/v3.1.6) (2022-1-20)
 [Full Changelog](https://github.com/sanger-pathogens/gubbins/compare/v3.1.5...v3.1.6)
 
+
 - Fix problem with sequence reconstruction
 - Add test for consistency of reconstructions
 

diff --git a/README.md b/README.md
@@ -105,7 +105,7 @@ chmod +x configure
 make
 sudo make install
 cd python
-python setup.py install
+python3 -m pip install .
 ```
 
 ### OSX/Linux/Windows - Virtual Machine

diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-3.1.6
+3.2.1
diff --git a/docs/gubbins_manual.md b/docs/gubbins_manual.md
@@ -170,7 +170,7 @@ Gubbins was originally designed to use a [joint ancestral state reconstruction](
 
 ### Recombination detection options
 
-Recombination is detected using a [spatial scanning statistic](https://link.springer.com/chapter/10.1007/978-1-4612-1578-3_14), which relies on a sliding window. The size of this window may need to be reduced if you apply Gubbins to very small genomes (e.g. viruses).
+Recombination is detected using a [spatial scanning statistic](https://link.springer.com/chapter/10.1007/978-1-4612-1578-3_14), which relies on a sliding window. The size of this window may need to be reduced if you apply Gubbins to very small genomes (e.g. viruses). To increase the sensitivity for detecting recombinations, `--min-snps` can be set at the minimum value of 2; the `--p-value` threshold required to detect recombinations can be increased; the `--trimming-ratio` can be raised above 1.0, to disfavour the trimming of recombination edges; and the `--extensive-search` mode can be used.
 
 ```
   --min-snps MIN_SNPS, -m MIN_SNPS
@@ -179,19 +179,26 @@ Recombination is detected using a [spatial scanning statistic](https://link.spri
                         Minimum window size (default: 100)
   --max-window-size MAX_WINDOW_SIZE, -b MAX_WINDOW_SIZE
                         Maximum window size (default: 10000)
+  --p-value P_VALUE     Uncorrected p value used to identify recombinations (default: 0.05)
+  --trimming-ratio TRIMMING_RATIO
+                        Ratio of log probabilities used to trim recombinations (default: 1.0)
+  --extensive-search    Undertake slower, more thorough, search for recombination (default: False)
 ```
 
-### Algorithm stop options
+### Algorithm stop and restart options
 
-Given the scale of available dataset sizes, and the size of tree space, it is unlikely that any Gubbins analysis will ever converge based on identifying identical trees in subsequent iterations. Note that trees from previous iterations are used as starting trees for inference in subsequent iterations with IQTree and RAxML (although not RAxML-NG). In practice, there is little improvement to the tree after three iterations.
+Given the scale of available dataset sizes, and the size of tree space, it is unlikely that any Gubbins analysis will ever converge based on identifying identical trees in subsequent iterations. Normally the algorithm will stop after reaching the maximum number of iterations. Should the run fail or stall before this point, the analysis can be restarted from the last iteration that successfully completed by providing a tree through the `--resume` flag (all other flags should be kept identical to the original commend, including `--iterations`). Note that although only the tree is provided to `--resume`, the corresponding alignment generated at the end of the same iteration also needs to be available within the same directory.
 
 ```
   --iterations ITERATIONS, -i ITERATIONS
                         Maximum No. of iterations (default: 5)
   --converge-method {weighted_robinson_foulds,robinson_foulds,recombination}, -z {weighted_robinson_foulds,robinson_foulds,recombination}
                         Criteria to use to know when to halt iterations (default: weighted_robinson_foulds)
+  --resume RESUME       Intermediate tree from previous run (must include "iteration_X" in file name) (default: None)
 ```
 
+Note that trees from previous iterations are used as starting trees for inference in subsequent iterations with IQTree and RAxML (although not RAxML-NG).
+
 ## Output files
 
 A successful Gubbins run will generate files with the suffixes:
@@ -221,13 +228,15 @@ The `.per_branch_statistics.csv` file contains summary statistics for each branc
 
 * **Node** - Name of the node subtended by the branch. This can either be one of the taxa included in the input alignment, or an internal node, which are numbered 
 * **Total SNPs** - Total number of base substitutions reconstructed onto the branch
-* **Num of SNPs inside recombinations** - Number of base substitutions reconstructed onto the branch that fall within a predicted recombination (*r*)
-* **Num of SNPs outside recombinations** - Number of base substitutions reconstructed onto the branch that fall outside of a predicted recombination. i.e. predicted to have arisen by point mutation (*m*)
-* **Num of Recombination Blocks** - Total number of recombination blocks reconstructed onto the branch
-* **Bases in recombinations** - Total length of all recombination events reconstructed onto the branch
+* **Number of SNPs Inside Recombinations** - Number of base substitutions reconstructed onto the branch that fall within a predicted recombination (*r*)
+* **Number of SNPs Outside Recombinations** - Number of base substitutions reconstructed onto the branch that fall outside of a predicted recombination. i.e. predicted to have arisen by point mutation (*m*)
+* **Number of Recombination Blocks** - Total number of recombination blocks reconstructed onto the branch
+* **Bases in Recombinations** - Total length of all recombination events reconstructed onto the branch
+* **Cumulative Bases in Recombinations** - Total number of bases in the alignment affected by recombination on this branch and its ancestors
 * ***r/m*** - The r/m value for the branch. This value gives a measure of the relative impact of recombination and mutation on the variation accumulated on the branch
 * ***rho/theta*** - The ratio of the number of recombination events to point mutations on a branch; a measure of the relative rates of recombination and point mutation
 * **Genome Length** - The total number of aligned bases between the ancestral and descendent nodes for the branch excluding any missing data or gaps in either
+* **Bases in Clonal Frame** - The number of called bases at the descendant node that have not been affected by recombination on this branch or an ancestor (i.e., the length of sequence that can be used for phylogenetic interpretation)
 
 Note that all positions in the output files are relative to the input alignment. If you wish to compare the positions of recombinations relative to a reference annotation, their coordinates will need to be adjusted to account for any gaps in the reference sequence introduced when generating the alignment.
 

diff --git a/python/gubbins/common.py b/python/gubbins/common.py
@@ -108,11 +108,28 @@ def parse_and_run(input_args, program_description=""):
     gaps_vcf_filename = base_filename + ".gaps.vcf"
     joint_sequences_filename = base_filename + ".seq.joint.aln"
 
+    # If restarting from a previous run
+    starting_iteration = 1
+    if input_args.resume is not None:
+        search_itr = re.search(r'iteration_(\d+)', input_args.resume)
+        if search_itr is None:
+            sys.stderr.write('Resuming a Gubbins run requires a tree file name containing the phrase "iteration_X"\n')
+            exit(1)
+        else:
+            starting_iteration = int(search_itr.group(1)) + 1
+            if starting_iteration >= input_args.iterations:
+                sys.stderr.write('Run has already reached the number of specified iterations\n')
+                exit(1)
+            else:
+                sys.stderr.write('Resuming Gubbins analysis at iteration ' + str(starting_iteration) + '\n')
+            input_args.starting_tree = input_args.resume
+            current_tree_name = input_args.starting_tree
+
     # Check if intermediate files from a previous run exist
     intermediate_files = [basename + ".iteration_"]
-    if not input_args.no_cleanup:
+    if not input_args.no_cleanup and input_args.resume is None:
         utils.delete_files(".", intermediate_files, "", input_args.verbose)
-    if utils.do_files_exist(".", intermediate_files, "", input_args.verbose):
+    if utils.do_files_exist(".", intermediate_files, "", input_args.verbose) and input_args.resume is None:
         sys.exit("Intermediate files from a previous run exist. Please rerun without the --no_cleanup option "
                  "to automatically delete them or with the --use_time_stamp to add a unique prefix.")
 
@@ -176,11 +193,11 @@ def parse_and_run(input_args, program_description=""):
     reconvert_fasta_file(gaps_alignment_filename, base_filename + ".start")
     # Start the main loop
     printer.print("\nEntering the main loop.")
-    for i in range(1, input_args.iterations+1):
+    for i in range(starting_iteration, input_args.iterations+1):
         printer.print("\n*** Iteration " + str(i) + " ***")
 
         # 1.1. Construct the tree-building command depending on the iteration and employed options
-        if i == 2:
+        if i == 2 or input_args.resume is not None:
             # Select the algorithms used for the subsequent iterations
             current_tree_builder, current_model_fitter, current_model, extra_tree_arguments, extra_model_arguments = return_algorithm_choices(input_args,i)
             # Initialise tree builder
@@ -247,7 +264,7 @@ def parse_and_run(input_args, program_description=""):
             # 3.2a. Joint ancestral reconstruction
             printer.print(["\nReconstructing ancestral sequences with pyjar..."])
 
-            if i == 1:
+            if i == starting_iteration:
 
                 # 3.3a. Read alignment and identify unique base patterns in first iteration only
 
@@ -281,6 +298,7 @@ def parse_and_run(input_args, program_description=""):
                 info_filename = info_filename, # file containing evolutionary model parameters
                 info_filetype = input_args.model_fitter, # model fitter - format of file containing evolutionary model parameters
                 output_prefix = temp_working_dir + "/" + ancestral_sequence_basename, # output prefix
+                outgroup_name = input_args.outgroup, # outgroup for rooting and reconstruction
                 threads = input_args.threads, # number of cores to use
                 verbose = input_args.verbose,
                 max_pos = max_pos)
@@ -354,7 +372,8 @@ def parse_and_run(input_args, program_description=""):
         shutil.copyfile(current_tree_name_with_internal_nodes, current_tree_name)
         gubbins_command = create_gubbins_command(
             gubbins_exec, gaps_alignment_filename, gaps_vcf_filename, current_tree_name,
-            input_args.alignment_filename, input_args.min_snps, input_args.min_window_size, input_args.max_window_size)
+            input_args.alignment_filename, input_args.min_snps, input_args.min_window_size, input_args.max_window_size,
+            input_args.p_value, input_args.trimming_ratio, input_args.extensive_search)
         printer.print(["\nRunning Gubbins to detect recombinations...", gubbins_command])
         try:
             subprocess.check_call(gubbins_command, shell=True)
@@ -617,13 +636,16 @@ def return_algorithm(algorithm_choice, model, input_args, node_labels = None, ex
     return initialised_algorithm
 
 def create_gubbins_command(gubbins_exec, alignment_filename, vcf_filename, current_tree_name,
-                           original_alignment_filename, min_snps, min_window_size, max_window_size):
+                           original_alignment_filename, min_snps, min_window_size, max_window_size,
+                           p_value, trimming_ratio, extensive_search):
     command = [gubbins_exec, "-r", "-v", vcf_filename, "-a", str(min_window_size),
                "-b", str(max_window_size), "-f", original_alignment_filename, "-t", current_tree_name,
-               "-m", str(min_snps), alignment_filename]
+               "-m", str(min_snps), "-p", str(p_value), "-i", str(trimming_ratio)]
+    if extensive_search:
+            command.append("-x")
+    command.append(alignment_filename)
     return " ".join(command)
 
-
 def number_of_sequences_in_alignment(filename):
     return len(get_sequence_names_from_alignment(filename))
 
@@ -734,14 +756,13 @@ def reroot_tree(tree_name, outgroups):
 
 def reroot_tree_with_outgroup(tree_name, outgroups):
     clade_outgroups = get_monophyletic_outgroup(tree_name, outgroups)
-    outgroups = [{'name': taxon_name} for taxon_name in clade_outgroups]
-
-    tree = Phylo.read(tree_name, 'newick')
-    tree.root_with_outgroup(*outgroups)
-    Phylo.write(tree, tree_name, 'newick')
-
     tree = dendropy.Tree.get_from_path(tree_name, 'newick', preserve_underscores=True)
-    tree.deroot()
+    outgroup_mrca = tree.mrca(taxon_labels=clade_outgroups)
+    print('Edge length is: ' + str(outgroup_mrca.edge.length))
+    tree.reroot_at_edge(outgroup_mrca.edge,
+                        length1 = outgroup_mrca.edge.length/2,
+                        length2 = outgroup_mrca.edge.length/2,
+                        update_bipartitions=False)
     tree.update_bipartitions()
     output_tree_string = tree_as_string(tree, suppress_internal=False)
     with open(tree_name, 'w+') as output_file: