Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GO-CAM stats & downloads to the pipeline #1180

Closed
lpalbou opened this issue Sep 23, 2019 · 12 comments
Closed

Add GO-CAM stats & downloads to the pipeline #1180

lpalbou opened this issue Sep 23, 2019 · 12 comments

Comments

@lpalbou
Copy link
Contributor

lpalbou commented Sep 23, 2019

This is a ticket to detail some under-the-hood processes and keep track of some proposals/requirements to be later discussed and prioritize.

  1. There were discussions about having GO-CAM stats computed and shown on the GO website (e.g. Tighter integration and access of GO-CAMs in view of the article release geneontology.github.io#180) or for general QCs. Those should be stored with the other GO stats recently delivered.

  2. In addition, the GO-CAM downloads are handled for the moment by my secondary pipeline which should also be merged to the main go pipeline - speaking of the red links:

Screen Shot 2019-09-23 at 3 54 41 PM

  1. Similarly, since the public facing GO-CAM site fetch data from a snapshot that changes only with each release, my scripts precompute some json files at each release (e.g.1 & e.g.2) for fast loading of the browse page. Those files could be stored also in the release or probably better, be deprecated in favor of the GO-CAM indexing in GOLr.
@dustine32
Copy link
Contributor

  1. Total count of GO-CAMs
  2. Count of GO-CAMs having at least 3 activities connected through causal relationships

For 2, we will use this query.

@lpalbou
Copy link
Contributor Author

lpalbou commented Sep 7, 2021

Some details discussed by email:

The query to fetch all the GO-CAMs having at least 3 activities connected through causal relationships is this one:
https://api.triplydb.com/s/m-c4NFkOe

Dustin, you could create a method to call that SPARQL query from there: go_stats.py#L132

This would modify the go-stats.json .

Then, for this stats to appear in the go-annotation-changes.json , you may need to alter part of this code: go-stats/go_annotation_changes.py#L8 as well as this one go_annotation_changes.py#L180 that is used to create the text/tab report that Pascale checks before the release.

If we want that stats to be used from the GO website, then it has to be added also to the go-stats-summary.json (loaded on the front page to show the stats on the top right). You can pick up the stats to add to that file here: go-stats/go_reports.py#L229

@dustine32
Copy link
Contributor

From 2021-09-14 Alliance Pathways call, add condition to "Total count of GO-CAMs" query to select only modelstate=="production" models.

@kltm
Copy link
Member

kltm commented Sep 14, 2021

Shouldn't only "production" be available on the prod SPARQL endpoint?

@lpalbou
Copy link
Contributor Author

lpalbou commented Sep 14, 2021

Just be careful not to select the _inferred GO-CAMs in your query, but yes those will only be production models from that triple store

@dustine32
Copy link
Contributor

Shouldn't only "production" be available on the prod SPARQL endpoint?

@kltm Ah yes, you are correct, thanks! Forgot that that is part of producing the endpoint triplestore. Shouldn't be an issue then.

@lpalbou Thanks for the heads up about excluding the "_inferred" models! @kltm Would these models also already get excluded from the production triplestore? I couldn't find any containing "inferred" in the title.

@kltm
Copy link
Member

kltm commented Sep 15, 2021

So, if I'm following here (tagging in @balhoff), there may be two types of models in a store: "real" models (noctua-generated and imports) and GAF-derived. The latter are likely to be uninteresting (with imported models being a separate interesting case--they may need to be marked). I guess the idea would be to filter those GAF-derived ones out; it might be worthwhile to look at their creation to see how they can be easily filtered.

@kltm kltm changed the title Add GO-CAM stats & downloads to the pipeline [to discuss & prioritize] Add GO-CAM stats & downloads to the pipeline Sep 22, 2021
@dustine32
Copy link
Contributor

From 2021-11-05 slack #developers discussion:

The Alliance site's gene page pathway viewer has a GO-CAMs tab with a number that's currently computed on the fly by a call to the GO-CAM API, which then queries the GO production RDF triplestore:
image
This means that the GO triplestore endpoint gets queried anytime an Alliance site user opens any gene page.

To reduce number of calls to the GO triplestore, we could just precompute this number (or better yet a GP_ID gomodelid model_title TSV) at GO release time and cache it somewhere for use, either directly by the Alliance site or refactor GO-CAM API to pull from this cache. A caveat to this "snapshot" approach is that the GO-CAM API also currently pulls data for rendering the model from the live modelstore via barista. So we'd likely see data sync issues (e.g. model existed at GO release but was deleted in Noctua a week later) unless we also cached all data required for rendering during GO release.

Adding this brainstorming note here since it's likely the code area where we would be implementing.

Tagging @kltm

@lpalbou
Copy link
Contributor Author

lpalbou commented Nov 5, 2021

Hi dustin. If you want to precompute, I would suggest to start from all the genes in GO-CAMs (less than in the Alliance) and create a dict { GP1 -> [model1, model2] , GP2 -> [model3] ... }. That file could be updated indeed at every release and used by the GO-CAM API since the goal was indeed to only show publicly released models.

The out of sync is a good point though... I wonder if go cams couldn't be published every months as .json files on release.geneontology.org ? The S3 could then serve as the source of data and it would be in sync with the cached API. It could help users get access to GO-CAMs as well, especially if the json is already in a format structured around activities ?

Have a good week end :)

@dustine32
Copy link
Contributor

Ah thanks so much! It definitely helps to get your confirmation here.

An activity-centric JSON format standard, ready for external users to consume, would be a good way to handle the caching aspect here. As we develop this, we can invent format versions, similar to the GPAD/GAF specs, and then update tools (like gocam-viz) to handle the differences. Definitely "project-able".

@lpalbou
Copy link
Contributor Author

lpalbou commented Nov 6, 2021

Exactly, then the viewer could be just a viewer and external users would have a simple file to work on. Tagging @cmungall as he had some ideas on the structure of such gocam file, more oriented PPIs.

@kltm
Copy link
Member

kltm commented Feb 23, 2022

Discussing w/ @pgaudet we'll revisit this fresh in an new issue in a new project.

@kltm kltm closed this as completed Feb 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants