Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EBI-GOA to generate since GAF & GPAD & GPI files for each species #1521

Closed
pgaudet opened this issue Jun 24, 2020 · 7 comments
Closed

EBI-GOA to generate since GAF & GPAD & GPI files for each species #1521

pgaudet opened this issue Jun 24, 2020 · 7 comments
Assignees

Comments

@pgaudet
Copy link
Contributor

pgaudet commented Jun 24, 2020

Hello,

Right now we have 4 files for several species (human chicken cow pig dog), which was requested here #156 (with little explanation as to why we anted that).

  1. annotations to canonical accessions from the UniProt reference proteome
  2. annotations to isoforms from the UniProt reference proteome
  3. annotations to complexes
  4. annotations to RNAs
  • @alexsign will generate a single file for those species.

  • we will also need to update the GOA yaml file

Thanks, Pascale

@kltm
Copy link
Member

kltm commented Jul 24, 2023

In discussion with @pgaudet , we'll try switching over to this after next successful release and @pgaudet is back online.

@cmungall
Copy link
Member

To clarify: this only applies to GAFs where there does not exist a MOD that assigns gene IDs.

For a MOD like dictyBase, even though the GAFs are currently managed at EBI-GOA, we just want a traditional GAF, the quad distinction (isoform, complex, etc) is only relevant when UniProt IDs are the primary IDs

This is the current dicty file which is what we want

curl -L -s ftp://ftp.ebi.ac.uk/pub/contrib/goa/dictyBase.gaf.gz | gzip -dc | grep -c ^dictyBase
79736

The GAFs that are here are uniprot ID-based:
https://ftp.ebi.ac.uk/pub/contrib/goa/grcp_plus_test/

We shouldn't consume or promote the MOD ones (dicty, zfin, etc) as this will lead to confusing.

Of course we still want the consolidates ones for human, cow, etc

@cmungall
Copy link
Member

Updating this issue to clarify what the actual requirements are for human, etc.

  • for protein coding genes, GCRP-only this is important; i.e not "2" in the list above
  • for RNAs, RNA central IDs
  • for complexes, complex portal

The so-called isoform file should not be included as this confuses things, messes up counts for enrichment etc.

This is in fact the contents of the files at https://ftp.ebi.ac.uk/pub/contrib/goa/grcp_plus_test/, as indicated in the header:

!The set of protein accessions included in this file is based on UniProt reference proteomes, which provide one protein per gene.
!They include the protein sequences annotated in Swiss-Prot or the longest TrEMBL transcript if there is no Swiss-Prot record.
!In addition this file included Swiss-Prot Isoforms, RNA and ComplexPortal annotations data

This is all good, but it is important to precisely clarify requirements / spec in the issue

But my previous point still stands, for dicty we want gene IDs

@cmungall cmungall changed the title GOA to generate since GAF & GPAD & GPI files for each species EBI-GOA to generate since GAF & GPAD & GPI files for each species Jul 26, 2023
@pgaudet
Copy link
Contributor Author

pgaudet commented Jul 26, 2023

@cmungall

Why dont we want annotations to isoforms? we are missing a lot of annotations because we exclude these.

@pgaudet
Copy link
Contributor Author

pgaudet commented Jul 10, 2024

Clarification: there are two ways people use 'isoforms'; curators mean splice variants, and Uniprot means any entry with small differences. The files now generated by GOA only contain splice variants-types isoforms.

@pgaudet
Copy link
Contributor Author

pgaudet commented Jul 10, 2024

This is replaced by #2341 (comment)

@pgaudet pgaudet closed this as not planned Won't fix, can't repro, duplicate, stale Jul 10, 2024
@deustp01
Copy link

there are two ways people use 'isoforms'; curators mean splice variants, and Uniprot means any entry with small differences.

And for what it's worth / historical note, Bill Pearson strongly preferred the UniProt broad sense: any variant polypeptide encoded by a gene as a result of alternative splicing or alternative transcriptional start sites, but not, I think variants due to any form of post-translational modification including peptide bind cleavage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants