Add GO-CAM stats & downloads to the pipeline #1180

lpalbou · 2019-09-23T23:05:49Z

This is a ticket to detail some under-the-hood processes and keep track of some proposals/requirements to be later discussed and prioritize.

There were discussions about having GO-CAM stats computed and shown on the GO website (e.g. Tighter integration and access of GO-CAMs in view of the article release geneontology.github.io#180) or for general QCs. Those should be stored with the other GO stats recently delivered.
In addition, the GO-CAM downloads are handled for the moment by my secondary pipeline which should also be merged to the main go pipeline - speaking of the red links:

Similarly, since the public facing GO-CAM site fetch data from a snapshot that changes only with each release, my scripts precompute some json files at each release (e.g.1 & e.g.2) for fast loading of the browse page. Those files could be stored also in the release or probably better, be deprecated in favor of the GO-CAM indexing in GOLr.

dustine32 · 2021-09-07T18:21:16Z

Total count of GO-CAMs
Count of GO-CAMs having at least 3 activities connected through causal relationships

For 2, we will use this query.

lpalbou · 2021-09-07T18:21:46Z

Some details discussed by email:

The query to fetch all the GO-CAMs having at least 3 activities connected through causal relationships is this one:
https://api.triplydb.com/s/m-c4NFkOe

Dustin, you could create a method to call that SPARQL query from there: go_stats.py#L132

This would modify the go-stats.json .

Then, for this stats to appear in the go-annotation-changes.json , you may need to alter part of this code: go-stats/go_annotation_changes.py#L8 as well as this one go_annotation_changes.py#L180 that is used to create the text/tab report that Pascale checks before the release.

If we want that stats to be used from the GO website, then it has to be added also to the go-stats-summary.json (loaded on the front page to show the stats on the top right). You can pick up the stats to add to that file here: go-stats/go_reports.py#L229

…ology/go-site#1180

dustine32 · 2021-09-14T18:23:49Z

From 2021-09-14 Alliance Pathways call, add condition to "Total count of GO-CAMs" query to select only modelstate=="production" models.

kltm · 2021-09-14T20:05:13Z

Shouldn't only "production" be available on the prod SPARQL endpoint?

lpalbou · 2021-09-14T20:12:29Z

Just be careful not to select the _inferred GO-CAMs in your query, but yes those will only be production models from that triple store

dustine32 · 2021-09-14T23:53:57Z

Shouldn't only "production" be available on the prod SPARQL endpoint?

@kltm Ah yes, you are correct, thanks! Forgot that that is part of producing the endpoint triplestore. Shouldn't be an issue then.

@lpalbou Thanks for the heads up about excluding the "_inferred" models! @kltm Would these models also already get excluded from the production triplestore? I couldn't find any containing "inferred" in the title.

kltm · 2021-09-15T00:59:45Z

So, if I'm following here (tagging in @balhoff), there may be two types of models in a store: "real" models (noctua-generated and imports) and GAF-derived. The latter are likely to be uninteresting (with imported models being a separate interesting case--they may need to be marked). I guess the idea would be to filter those GAF-derived ones out; it might be worthwhile to look at their creation to see how they can be easily filtered.

dustine32 · 2021-11-05T23:06:50Z

From 2021-11-05 slack #developers discussion:

The Alliance site's gene page pathway viewer has a GO-CAMs tab with a number that's currently computed on the fly by a call to the GO-CAM API, which then queries the GO production RDF triplestore:

This means that the GO triplestore endpoint gets queried anytime an Alliance site user opens any gene page.

To reduce number of calls to the GO triplestore, we could just precompute this number (or better yet a GP_ID gomodelid model_title TSV) at GO release time and cache it somewhere for use, either directly by the Alliance site or refactor GO-CAM API to pull from this cache. A caveat to this "snapshot" approach is that the GO-CAM API also currently pulls data for rendering the model from the live modelstore via barista. So we'd likely see data sync issues (e.g. model existed at GO release but was deleted in Noctua a week later) unless we also cached all data required for rendering during GO release.

Adding this brainstorming note here since it's likely the code area where we would be implementing.

Tagging @kltm

lpalbou · 2021-11-05T23:33:36Z

Hi dustin. If you want to precompute, I would suggest to start from all the genes in GO-CAMs (less than in the Alliance) and create a dict { GP1 -> [model1, model2] , GP2 -> [model3] ... }. That file could be updated indeed at every release and used by the GO-CAM API since the goal was indeed to only show publicly released models.

The out of sync is a good point though... I wonder if go cams couldn't be published every months as .json files on release.geneontology.org ? The S3 could then serve as the source of data and it would be in sync with the cached API. It could help users get access to GO-CAMs as well, especially if the json is already in a format structured around activities ?

Have a good week end :)

dustine32 · 2021-11-05T23:47:49Z

Ah thanks so much! It definitely helps to get your confirmation here.

An activity-centric JSON format standard, ready for external users to consume, would be a good way to handle the caching aspect here. As we develop this, we can invent format versions, similar to the GPAD/GAF specs, and then update tools (like gocam-viz) to handle the differences. Definitely "project-able".

lpalbou · 2021-11-06T12:14:53Z

Exactly, then the viewer could be just a viewer and external users would have a simple file to work on. Tagging @cmungall as he had some ideas on the structure of such gocam file, more oriented PPIs.

kltm · 2022-02-23T19:43:06Z

Discussing w/ @pgaudet we'll revisit this fresh in an new issue in a new project.

lpalbou added enhancement question revisit labels Sep 23, 2019

dustine32 added a commit to geneontology/go-stats that referenced this issue Sep 8, 2021

WIP - SPARQL caller for GO-CAM stats - geneontology/go-site#1180

a39e8e9

dustine32 added a commit to geneontology/go-stats that referenced this issue Sep 9, 2021

Finished passing gocams fields to reports; updated networkx - geneont…

1577d6e

…ology/go-site#1180

dustine32 mentioned this issue Sep 9, 2021

Add GO-CAM stats geneontology/go-stats#18

Open

3 tasks

dustine32 added a commit to geneontology/go-stats that referenced this issue Sep 10, 2021

Fill in missing gocam stats in prev stats - geneontology/go-site#1180

c79be47

kltm changed the title ~~Add GO-CAM stats & downloads to the pipeline [to discuss & prioritize]~~ Add GO-CAM stats & downloads to the pipeline Sep 22, 2021

dustine32 mentioned this issue Jan 21, 2022

Add JSON product production for GO-CAM API to pipeline geneontology/pipeline#265

Open

12 tasks

kltm closed this as completed Feb 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GO-CAM stats & downloads to the pipeline #1180

Add GO-CAM stats & downloads to the pipeline #1180

lpalbou commented Sep 23, 2019

dustine32 commented Sep 7, 2021

lpalbou commented Sep 7, 2021

dustine32 commented Sep 14, 2021

kltm commented Sep 14, 2021

lpalbou commented Sep 14, 2021

dustine32 commented Sep 14, 2021

kltm commented Sep 15, 2021

dustine32 commented Nov 5, 2021

lpalbou commented Nov 5, 2021

dustine32 commented Nov 5, 2021

lpalbou commented Nov 6, 2021

kltm commented Feb 23, 2022

Add GO-CAM stats & downloads to the pipeline #1180

Add GO-CAM stats & downloads to the pipeline #1180

Comments

lpalbou commented Sep 23, 2019

dustine32 commented Sep 7, 2021

lpalbou commented Sep 7, 2021

dustine32 commented Sep 14, 2021

kltm commented Sep 14, 2021

lpalbou commented Sep 14, 2021

dustine32 commented Sep 14, 2021

kltm commented Sep 15, 2021

dustine32 commented Nov 5, 2021

lpalbou commented Nov 5, 2021

dustine32 commented Nov 5, 2021

lpalbou commented Nov 6, 2021

kltm commented Feb 23, 2022