Roary grouping genes that don't meet similarity threshold (using `-i 100`, and `-s`) #616

LijMeh · 2024-03-19T13:53:17Z

I've been running Roary in order to use the presence/absence tables as input for training ML models (to identify sub-species groups based on shared genomic features). As I don't really care if genes are paralogs, I included option -s, but, because I do care if the proteins differ by even one amino acid, I set -i 100.

My understanding is that this should cause the output from Roary to capture all variants of every gene as their own "cluster" (while ignoring the syntany of the gene), allowing me to determine which strains have which unique variants. However, in practice, I'm frequently seeing genes grouped into clusters that aren't 100% identical (via blastp).

I've verified this by pulling genes out using the query_pan_genome function, by manually extracting genes based on their location in the .gff files, and by nucleotide blast followed by translating to a protein.

If anyone has any ideas of why this might be the case that would extremely helpful, as, for the genes that aren't grouping correctly I can't seem to find any logic as to why they might be clustered together (based on my inputs).

As an aside, the only thing I've found that seems to correlate with this pattern (but not perfectly), is that proteins that are mis-grouped occasionally have similar reading frame directions (ie. + or - in the .gff file), so one group might end up being filled with proteins that are generally read on the - strand, and another that are generally read on the + strand (but never a 100% split).

The text was updated successfully, but these errors were encountered:

memoriasresiduais · 2024-04-17T15:15:40Z

i have the same problem: i could detected different protein variants in the same roary group, despite using -i 100 -s...

LijMeh · 2024-04-17T23:54:21Z

Thanks for the comment, glad to see I'm not the only person having this issue. From my preliminary testing it seems like Roary only starts working "correctly" when you go down to a 98% threshold. I'm currently working on a tool that independently validates Roary's results (as I need to use it for a paper but obviously don't fully trust it anymore). I'd be happy to share that once it's complete.

memoriasresiduais · 2024-04-18T08:37:42Z

I also tried Panaroo and same problem: proteins with >98% pooled in the same family, despite -i 100....
Happy to try your tool for validation!

LijMeh · 2024-04-18T14:06:31Z

For sure, will share once it's finished (hopefully in the next couple of weeks).

In the meantime, have you created a GitHub issue on the Panaroo page? It looks like they're pretty responsive as the program is still being maintained/developed. (I'll probably run my data through it too, but might take a bit to verify I'm having the issue on my end)

memoriasresiduais · 2024-04-18T14:56:56Z

Hi LijMeh,
i just tried panacota and it seems to be pooling my gene families properly using a cut-off of 100% aa identity (-i 1) (https://aperrin.pages.pasteur.fr/pipeline_annotation/html-doc/usage.html#pangenome-subcommand)
I'll do more checks though to be sure.
feel free to reach me directly, if you wish
good luck,
m
[email protected]

memoriasresiduais · 2024-04-19T15:22:58Z

Just to add for the records:
on my roary analyses w/ -i 100 -s, roary was not only pooling different protein variants in the same roary group, but also placing identical proteins in different groups...
no idea why this happens.

memoriasresiduais mentioned this issue Apr 18, 2024

cut-off of 100% aa identity not working properly gtonkinhill/panaroo#286

Closed

memoriasresiduais mentioned this issue May 15, 2024

missing CDSs in gene_presence_absence.csv? #367

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roary grouping genes that don't meet similarity threshold (using `-i 100`, and `-s`) #616

Roary grouping genes that don't meet similarity threshold (using `-i 100`, and `-s`) #616

LijMeh commented Mar 19, 2024

memoriasresiduais commented Apr 17, 2024

LijMeh commented Apr 17, 2024

memoriasresiduais commented Apr 18, 2024

LijMeh commented Apr 18, 2024

memoriasresiduais commented Apr 18, 2024

memoriasresiduais commented Apr 19, 2024

Roary grouping genes that don't meet similarity threshold (using -i 100, and -s) #616

Roary grouping genes that don't meet similarity threshold (using -i 100, and -s) #616

Comments

LijMeh commented Mar 19, 2024

memoriasresiduais commented Apr 17, 2024

LijMeh commented Apr 17, 2024

memoriasresiduais commented Apr 18, 2024

LijMeh commented Apr 18, 2024

memoriasresiduais commented Apr 18, 2024

memoriasresiduais commented Apr 19, 2024

Roary grouping genes that don't meet similarity threshold (using `-i 100`, and `-s`) #616

Roary grouping genes that don't meet similarity threshold (using `-i 100`, and `-s`) #616