-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore SGD's pre-import and post-import GPADs to understand the origin of the 10% annotation increase #271
Comments
I would encourage people to start with the files above, as I think I made a mistake.
Results: |
It's not surprising that the two files are different, because the post file includes models that were created de-novo in Noctua, and were never imported into Protein2GO For example, this one: Was manually created by Stacia in Noctua in 2021. So this is naturally in the complete set of SGD annotations that get exported from Noctua. However, if the goal is to compare pre and post then the exact same annotations need to be compared and models that were manually curated in Noctua need to be subtracted out for comparison purposes. |
A filter can be added then by subtracting the noctua_sgd.gpad set from the current release (http://current.geneontology.org/products/upstream_and_raw_data/noctua_sgd-src.gpad.gz), which would give the "addition set". |
There are also some other odd things going on with timing @kltm can you provide more of a timeline for these files?
Here is GO:0043044 in the post file:
This comes from the Noctua-native model I mentioned above. Note the annotation is to GO:0043044, which was merged into GO:0006338 in 2021: The current model has been successfully migrated: http://noctua.geneontology.org/download/gomodel:600ced8500001127/gpad I believe the model file was migrated to the new term some time ago (on a flight w very slow wifi so can't check), but the history should be here: https://github.com/geneontology/noctua-models/blob/master/models/600ced8500001127.ttl so how is it possible an annotation to GO:0043044 made its way into the post file? |
Here is the commit, which was in April This replaced GO:0043044 with GO:0006338. |
A lot of scrambling around in the last week, so let's soft reset this all:
(we took some shortcuts as we didn't have some proper test runs, so wrong files may have been grabbed; Suzi's initial finds, however, were from the pseudo-GPAD produced by the GPAD output workbench on the newly live models, and our initial focus was at that end) |
@kltm Here is the permalink to the file in GH: |
Okay, resetting this all, I've created a new export file with:
The pre and post import files are now at: |
|
|
Hi @kltm The EXPORT file contains 4,583 lines more than the IMPORT file Calculation of interences
The 'worse' one is reproductive process, but I removed the logical definition, so this should go. |
These are reflecting broader problems with GO rather than anything to do with the process per se Let's take If we look at the source publication for many of these We see that the curator clearly wanted the more useful I know this seems a bit off topic but this is common in GO where we attach the symptoms rather than causes, which is much more expensive Others are coming from F->P. E.g.
This will be addressed by the current refactoring. So overall my opinion is these are minor additions that are valid yet trivial and will eventually disappear as the ontology improves. Of course we must still decouple inference from conversion |
@suzialeksander will make ontology tickets for the remaining terms to be tagged |
Origin of increase seems to be identified. SGD is manually fixing some annotations that didn't make the move, but after above ticket there don't seem to be other additional annotations, just "unfolded" extensions. |
Explore SGD's pre-import and post-import GPADs to understand the origin of the 10% annotation increase. The theory here is that it is likely that a small number of ontology issues could explain the difference.The best files for comparison at this moment (while the pipeline creates a more recent batch that should only differ in an ORCID fix) are at http://skyhook.berkeleybop.org/confinement-for-noctua-models-271/ . Namely:http://skyhook.berkeleybop.org/confinement-for-noctua-models-271/pre_import_sgd.gpad (GPAD 2.0, before minerva ingest)~http://skyhook.berkeleybop.org/confinement-for-noctua-models-271/post_import_sgd.gpad (GPAD
1.1, export from minerva)See lower down at: #271 (comment)
(Also see geneontology/minerva#539)
The text was updated successfully, but these errors were encountered: