Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roary grouping genes that don't meet similarity threshold (using -i 100, and -s) #616

Open
LijMeh opened this issue Mar 19, 2024 · 6 comments

Comments

@LijMeh
Copy link

LijMeh commented Mar 19, 2024

I've been running Roary in order to use the presence/absence tables as input for training ML models (to identify sub-species groups based on shared genomic features). As I don't really care if genes are paralogs, I included option -s, but, because I do care if the proteins differ by even one amino acid, I set -i 100.

My understanding is that this should cause the output from Roary to capture all variants of every gene as their own "cluster" (while ignoring the syntany of the gene), allowing me to determine which strains have which unique variants. However, in practice, I'm frequently seeing genes grouped into clusters that aren't 100% identical (via blastp).

I've verified this by pulling genes out using the query_pan_genome function, by manually extracting genes based on their location in the .gff files, and by nucleotide blast followed by translating to a protein.

If anyone has any ideas of why this might be the case that would extremely helpful, as, for the genes that aren't grouping correctly I can't seem to find any logic as to why they might be clustered together (based on my inputs).

As an aside, the only thing I've found that seems to correlate with this pattern (but not perfectly), is that proteins that are mis-grouped occasionally have similar reading frame directions (ie. + or - in the .gff file), so one group might end up being filled with proteins that are generally read on the - strand, and another that are generally read on the + strand (but never a 100% split).

@memoriasresiduais
Copy link

i have the same problem: i could detected different protein variants in the same roary group, despite using -i 100 -s...

@LijMeh
Copy link
Author

LijMeh commented Apr 17, 2024

Thanks for the comment, glad to see I'm not the only person having this issue. From my preliminary testing it seems like Roary only starts working "correctly" when you go down to a 98% threshold. I'm currently working on a tool that independently validates Roary's results (as I need to use it for a paper but obviously don't fully trust it anymore). I'd be happy to share that once it's complete.

@memoriasresiduais
Copy link

I also tried Panaroo and same problem: proteins with >98% pooled in the same family, despite -i 100....
Happy to try your tool for validation!

@LijMeh
Copy link
Author

LijMeh commented Apr 18, 2024

For sure, will share once it's finished (hopefully in the next couple of weeks).

In the meantime, have you created a GitHub issue on the Panaroo page? It looks like they're pretty responsive as the program is still being maintained/developed. (I'll probably run my data through it too, but might take a bit to verify I'm having the issue on my end)

@memoriasresiduais
Copy link

Hi LijMeh,
i just tried panacota and it seems to be pooling my gene families properly using a cut-off of 100% aa identity (-i 1) (https://aperrin.pages.pasteur.fr/pipeline_annotation/html-doc/usage.html#pangenome-subcommand)
I'll do more checks though to be sure.
feel free to reach me directly, if you wish
good luck,
m
[email protected]

@memoriasresiduais
Copy link

Just to add for the records:
on my roary analyses w/ -i 100 -s, roary was not only pooling different protein variants in the same roary group, but also placing identical proteins in different groups...
no idea why this happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants