Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore SGD's pre-import and post-import GPADs to understand the origin of the 10% annotation increase #271

Closed
kltm opened this issue Feb 14, 2024 · 16 comments
Assignees
Labels

Comments

@kltm
Copy link
Member

kltm commented Feb 14, 2024

Explore SGD's pre-import and post-import GPADs to understand the origin of the 10% annotation increase. The theory here is that it is likely that a small number of ontology issues could explain the difference.

The best files for comparison at this moment (while the pipeline creates a more recent batch that should only differ in an ORCID fix) are at http://skyhook.berkeleybop.org/confinement-for-noctua-models-271/ . Namely:

http://skyhook.berkeleybop.org/confinement-for-noctua-models-271/pre_import_sgd.gpad (GPAD 2.0, before minerva ingest)
~http://skyhook.berkeleybop.org/confinement-for-noctua-models-271/post_import_sgd.gpad (GPAD 1.1, export from minerva)

See lower down at: #271 (comment)
(Also see geneontology/minerva#539)

@kltm kltm added the question label Feb 14, 2024
@kltm
Copy link
Member Author

kltm commented Feb 14, 2024

I would encourage people to start with the files above, as I think I made a mistake.
Just personally poking around, looking through the col4s, sorting, uniqing, and diffing them:

cat pre_import_sgd.gpad | grep -v '^!' | cut -f 4 | sort | uniq > pre_terms.txt
cat post_import_sgd.gpad | grep -v '^!' | cut -f 4 | sort | uniq > post_terms.txt
diff pre_terms.txt post_terms.txt | grep '>' | cut -f 2 -d ' ' > exclusive_post_terms.txt 
diff pre_terms.txt post_terms.txt | grep '<' | cut -f 2 -d ' ' > exclusive_pre_terms.txt 

Results:
http://skyhook.berkeleybop.org/confinement-for-noctua-models-271/exclusive_pre_terms.txt (terms that exclusively appear before the import)
http://skyhook.berkeleybop.org/confinement-for-noctua-models-271/exclusive_post_terms.txt (terms that exclusively appear after the import)

@kltm
Copy link
Member Author

kltm commented Feb 14, 2024

@cmungall
Copy link
Member

It's not surprising that the two files are different, because the post file includes models that were created de-novo in Noctua, and were never imported into Protein2GO

For example, this one:
http://noctua.geneontology.org/editor/graph/gomodel:600ced8500001127?

Was manually created by Stacia in Noctua in 2021.

So this is naturally in the complete set of SGD annotations that get exported from Noctua. However, if the goal is to compare pre and post then the exact same annotations need to be compared and models that were manually curated in Noctua need to be subtracted out for comparison purposes.

@kltm
Copy link
Member Author

kltm commented Feb 15, 2024

A filter can be added then by subtracting the noctua_sgd.gpad set from the current release (http://current.geneontology.org/products/upstream_and_raw_data/noctua_sgd-src.gpad.gz), which would give the "addition set".

@cmungall
Copy link
Member

cmungall commented Feb 15, 2024

There are also some other odd things going on with timing

@kltm can you provide more of a timeline for these files?

  • when was the minerva ingest
  • what is the URL for the file that was ingested
  • when was the export

Here is GO:0043044 in the post file:

grep GO:0043044 post_import_sgd.gpad
SGD	S000000966	enables	GO:0140658	PMID:33174727	ECO:0000314			20210706	SGD	part_of(GO:0043044)	contributor=https://orcid.org/0000-0001-5472-917X|noctua-model-id=gomodel:600ced8500001127|model-state=production
SGD	S000000966	involved_in	GO:0043044	PMID:33174727	ECO:0000314			20210706	SGD		contributor=https://orcid.org/0000-0001-5472-917X|noctua-model-id=gomodel:600ced8500001127|model-state=production

This comes from the Noctua-native model I mentioned above. Note the annotation is to GO:0043044, which was merged into GO:0006338 in 2021:

The current model has been successfully migrated:

http://noctua.geneontology.org/download/gomodel:600ced8500001127/gpad

I believe the model file was migrated to the new term some time ago (on a flight w very slow wifi so can't check), but the history should be here: https://github.com/geneontology/noctua-models/blob/master/models/600ced8500001127.ttl

so how is it possible an annotation to GO:0043044 made its way into the post file?

@cmungall
Copy link
Member

cmungall commented Feb 15, 2024

Here is the commit, which was in April 2024 2022
7725dbe

This replaced GO:0043044 with GO:0006338.

@kltm
Copy link
Member Author

kltm commented Feb 15, 2024

A lot of scrambling around in the last week, so let's soft reset this all:

  • we'll find the gpad that was used to produce the imported TTLs (@dustine32, when you have a chance)
  • (we'll confirm that they seem to be related as we expect)
  • I'll build a local blazegraph with these TTLs only and output a new GPAD with this blazegraph; this new GPAD should be as close as possible to the original, with as little noise as possible, plus the inferences
  • we start comparing again from scratch

(we took some shortcuts as we didn't have some proper test runs, so wrong files may have been grabbed; Suzi's initial finds, however, were from the pseudo-GPAD produced by the GPAD output workbench on the newly live models, and our initial focus was at that end)

@dustine32
Copy link
Contributor

the gpad that was used to produce the imported TTLs

@kltm Here is the permalink to the file in GH:
https://raw.githubusercontent.com/geneontology/sgd-go-cams/bda1b9d21b830f91882081b29a8f5f0b07fbc631/products/go_cam_sgd_valid.gpad

@kltm
Copy link
Member Author

kltm commented Feb 16, 2024

Okay, resetting this all, I've created a new export file with:

sh ./local/src/git/minerva/minerva-cli/bin/minerva-cli.sh --import-owl-models -f ~/local/src/git/sgd-go-cams/models -j /tmp/blazegraph.jnl
mkdir -p /tmp/legacy/gpad && MINERVA_CLI_MEMORY=8G ./local/src/git/minerva/minerva-cli/bin/minerva-cli.sh --lego-to-gpad-sparql --ontology https://current.geneontology.org/ontology/extensions/go-lego.owl --ontojournal ontojournal.jnl -i /tmp/blazegraph.jnl --gpad-output /tmp/legacy/gpad
cat /tmp/legacy/gpad/*.gpad | grep -v '^!' > /tmp/sgd_export.gpad

The pre and post import files are now at:
http://skyhook.berkeleybop.org/confinement-for-noctua-models-271/sgd_import.gpad
http://skyhook.berkeleybop.org/confinement-for-noctua-models-271/sgd_export.gpad

@kltm
Copy link
Member Author

kltm commented Feb 16, 2024

cat sgd_import.gpad | grep -v '^!' | cut -f 4 | sort | uniq > pre_terms.txt
cat sgd_export.gpad | grep -v '^!' | cut -f 4 | sort | uniq > post_terms.txt
diff pre_terms.txt post_terms.txt | grep '>' | cut -f 2 -d ' ' > exclusive_post_terms.txt 
diff pre_terms.txt post_terms.txt | grep '<' | cut -f 2 -d ' ' > exclusive_pre_terms.txt 

@kltm
Copy link
Member Author

kltm commented Feb 16, 2024

wc -l *.gpad
   55197 sgd_export.gpad
   50616 sgd_import.gpad

@pgaudet
Copy link

pgaudet commented Feb 16, 2024

Hi @kltm
Thanks for sharing these.

The EXPORT file contains 4,583 lines more than the IMPORT file
There are 2055 GO terms in extensions in the import file, and the unfolding step of the GO pipeline instantiates these as annotations, so as far as I can tell, that the 'real' diff is 2,528. Inferences only account for 29 additional annotations (see below)

Calculation of interences
There are only 10 IDs that I find in the EXPORT file that are not anywhere in the IMPORT file; here are the GO terms, with the number of occurrences in the EXPORT file:

GOID LABEL COUNT in EXPORT FILE
GO:0022414 reproductive process 14
GO:0042918 alkanesulfonate transport' 1
GO:0042959 alkanesulfonate transmembrane transporter activity' 1
GO:0061425 positive regulation of ethanol catabolic process by positive regulation of transcription from RNA polymerase II promoter 2
GO:0071705 nitrogen compound transport 1
GO:0072337 modified amino acid transport 3
GO:1900068 negative regulation of cellular response to alkaline pH 3
GO:1900070 negative regulation of cellular hyperosmotic salinity response 2
GO:1900072 positive regulation of sulfite transport 1
GO:1903047 mitotic cell cycle process 1

The 'worse' one is reproductive process, but I removed the logical definition, so this should go.

@cmungall
Copy link
Member

These are reflecting broader problems with GO rather than anything to do with the process per se

Let's take nitrogen compound transport. What an awful, useless term. If not obsoleted, it should at least have a do-not-annotate (which should block propagation). There are 78894 chemical entities, big and small, classified under CHEBI:51143. I'm surprised we have direct annotations to it even more surprised to see IBAs.

If we look at the source publication for many of these
https://amigo.geneontology.org/amigo/reference/PMID:24842606

We see that the curator clearly wanted the more useful protoporphyrin transport but we didn't have this so they chose the most specific term and then did an extension. Then when it gets propagated by IBA only the useless nitrogen compound transport is propagated.

I know this seems a bit off topic but this is common in GO where we attach the symptoms rather than causes, which is much more expensive

Others are coming from F->P. E.g.

id: GO:0042959
name: alkanesulfonate transmembrane transporter activity
...
relationship: part_of GO:0042918 {http://purl.org/dc/terms/source="GO_REF:0000090"} ! alkanesulfonate transport

This will be addressed by the current refactoring.

So overall my opinion is these are minor additions that are valid yet trivial and will eventually disappear as the ontology improves. Of course we must still decouple inference from conversion

@suzialeksander
Copy link

@suzialeksander will make ontology tickets for the remaining terms to be tagged do_not_annotate

@suzialeksander
Copy link

@suzialeksander
Copy link

Origin of increase seems to be identified. SGD is manually fixing some annotations that didn't make the move, but after above ticket there don't seem to be other additional annotations, just "unfolded" extensions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants