-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Roary grouping genes that don't meet similarity threshold (using -i 100
, and -s
)
#616
Comments
i have the same problem: i could detected different protein variants in the same roary group, despite using -i 100 -s... |
Thanks for the comment, glad to see I'm not the only person having this issue. From my preliminary testing it seems like Roary only starts working "correctly" when you go down to a 98% threshold. I'm currently working on a tool that independently validates Roary's results (as I need to use it for a paper but obviously don't fully trust it anymore). I'd be happy to share that once it's complete. |
I also tried Panaroo and same problem: proteins with >98% pooled in the same family, despite -i 100.... |
For sure, will share once it's finished (hopefully in the next couple of weeks). In the meantime, have you created a GitHub issue on the Panaroo page? It looks like they're pretty responsive as the program is still being maintained/developed. (I'll probably run my data through it too, but might take a bit to verify I'm having the issue on my end) |
Hi LijMeh, |
Just to add for the records: |
I've been running Roary in order to use the presence/absence tables as input for training ML models (to identify sub-species groups based on shared genomic features). As I don't really care if genes are paralogs, I included option
-s
, but, because I do care if the proteins differ by even one amino acid, I set-i 100
.My understanding is that this should cause the output from Roary to capture all variants of every gene as their own "cluster" (while ignoring the syntany of the gene), allowing me to determine which strains have which unique variants. However, in practice, I'm frequently seeing genes grouped into clusters that aren't 100% identical (via blastp).
I've verified this by pulling genes out using the
query_pan_genome
function, by manually extracting genes based on their location in the.gff
files, and by nucleotide blast followed by translating to a protein.If anyone has any ideas of why this might be the case that would extremely helpful, as, for the genes that aren't grouping correctly I can't seem to find any logic as to why they might be clustered together (based on my inputs).
As an aside, the only thing I've found that seems to correlate with this pattern (but not perfectly), is that proteins that are mis-grouped occasionally have similar reading frame directions (ie.
+
or-
in the.gff
file), so one group might end up being filled with proteins that are generally read on the-
strand, and another that are generally read on the+
strand (but never a 100% split).The text was updated successfully, but these errors were encountered: