Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

normalize ENVO terms #25

Open
cmungall opened this issue Sep 29, 2020 · 11 comments
Open

normalize ENVO terms #25

cmungall opened this issue Sep 29, 2020 · 11 comments
Assignees

Comments

@cmungall
Copy link
Collaborator

These are mostly strings. Some do not correspond to a class label, e.g. 'tundra'

There should be a repair step that gets the IDs. I suggest a denormalized/flattened schema where we append _id onto the field name, e.g. env_local_scale_id=ENVO:nnnn. In the NMDC/MIxS schema this is a compound object

@turbomam
Copy link
Collaborator

Is this a matter of normalizing ENVO terms to something (more authoritative? better structured? better coverage?)

Or is it a matter of normalizing from the NMDC/MIxS schema to ENVO?

Or from user-submitted values (intended for NMDC/MIxS) to ENVO?

@cmungall
Copy link
Collaborator Author

name->ID

@cmungall
Copy link
Collaborator Author

let's look at the input table

$ mlr --ocsv --itsvlite cut -f accession,package_name,env_broad_scale,env_medium,env_local_scale downloads/harmonized-table.tsv then filter 'env_broad_scale != ""'
accession,env_broad_scale,env_medium,package_name,env_local_scale
SAMN00000002,terrestrial biome [ENVO:00000446],biological product [ENVO:02000043],MIGS: cultured bacteria/archaea; version 5.0,human-associated habitat [ENVO:00009003]
SAMN00000003,terrestrial biome [ENVO:00000446],biological product [ENVO:02000043],MIGS: cultured bacteria/archaea; version 5.0,human-associated habitat [ENVO:00009003]
SAMN00000004,terrestrial biome [ENVO:00000446],biological product [ENVO:02000043],MIGS: cultured bacteria/archaea; version 5.0,human-associated habitat [ENVO:00009003]

^^ these are ok. This also conforms to our schema

  env_broad_scale:
    is_a: environment field
    aliases:
    - broad-scale environmental context
    description: "In this field, report which major environmental system your sample\
      \ or specimen came from. The systems identified should have a coarse spatial\
      \ grain, to provide the general environmental context of where the sampling\
      \ was done (e.g. were you in the desert or a rainforest?). We recommend using\
      \ subclasses of ENVO\u2019s biome class: http://purl.obolibrary.org/obo/ENVO_00000428.\
      \ Format (one term): termLabel [termID], Format (multiple terms): termLabel\
      \ [termID]|termLabel [termID]|termLabel [termID]. Example: Annotating a water\
      \ sample from the photic zone in middle of the Atlantic Ocean, consider: oceanic\
      \ epipelagic zone biome [ENVO:01000033]. Example: Annotating a sample from the\
      \ Amazon rainforest consider: tropical moist broadleaf forest biome [ENVO:01000228].\
      \ If needed, request new terms on the ENVO tracker, identified here: http://www.obofoundry.org/ontology/envo.html"
    pattern: '{termLabel} {[termID]}'
    examples:
    - value: forest biome [ENVO:01000174]

but look at others

accession,env_broad_scale,env_medium,package_name,env_local_scale
...
SAMN00001340,aquatic,saline water,"MIMS: metagenome/environmental, water; version 5.0",Pacific Ocean
SAMN00001362,aquatic,saline water,"MIMS: metagenome/environmental, water; version 5.0",Pacific Ocean

^^ the submitter gave strings not IDs. We want to fix

replace aquatic with ENVO ID for aquatic biome

replace saline water with ENVO ID for aquatic biome

I think "pacific ocean" is just the wrong string for env_local_scale

for ones that can't be matched, just report and move on

replace each string with mixs syntax

"LABEL [ENVO:nnnn]"

@turbomam
Copy link
Collaborator

@hrshdhgd have you done much with this yet? @wdduncan helped me find relevant input data and utilities and I have been reading about MIxS in general. I think I could do the following now: map unique values from env_broad_scale, env_medium and env_local_scale to the "LABEL [ENVO:nnnn]" notation, as TSV output.

  • I guess interleaving those mappings back into harmonized-table.tsv shouldn't be too hard, but I haven't planned that out yet.
  • I haven't planned any quality filters yet either
  • https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/ doesn't seem to suggest mapping package_name to ENVO. Does that make sense to you?

@turbomam
Copy link
Collaborator

Also @cmungall and others, it seems that accession is very frequently blank. I know that it wouldn't make sens to map that, but it makes me a little uncomfortable to see so many blanks in what might be the primary key for this table

@wdduncan
Copy link
Collaborator

@turbomam I am normalizing the package names in ticket #24

Also, the primary key is in id field (e.g., BIOSAMPLE:SAMN00000002).

@turbomam
Copy link
Collaborator

Thaks @wdduncan

I'm curious, but this is probably not relevant to this task:
What is accession used for vs. id?

@wdduncan
Copy link
Collaborator

@turbomam I'm not sure about the meaning of the accession field. It seems to be some kind of identifier that the INCA uses. But there are other ways the identifiers are captured in the biosample_set.xml; e.g., here is an xml blob from that file:

<BioSample submission_date="2008-04-04T08:44:24.950" last_update="2019-06-20T16:11:22.271" publication_date="2008-04-04T00:00:00.000" access="public" id="2" accession="SAMN00000002">
  <Ids>
    <Id db="BioSample" is_primary="1">SAMN00000002</Id>
    <Id db="WUGSC" db_label="Sample name">19655</Id>
    <Id db="SRA">SRS000002</Id>
  </Ids>
....
</Biosample>

In this case the accession has a value.

@hrshdhgd
Copy link
Collaborator

hrshdhgd commented Mar 29, 2021

@turbomam, by accession, you mean the column named accession_biosample_id, correct?

@hrshdhgd have you done much with this yet? @wdduncan helped me find relevant input data and utilities and I have been reading about MIxS in general. I think I could do the following now: map unique values from env_broad_scale, env_medium and env_local_scale to the "LABEL [ENVO:nnnn]" notation, as TSV output.

I have not yet. I think that seems like a good plan.

  • I guess interleaving those mappings back into harmonized-table.tsv shouldn't be too hard, but I haven't planned that out yet.

I'm guessing a JOIN using id and accession_biosample_id as keys should do the trick?

  • I haven't planned any quality filters yet either

Something we'll need to discuss further

There is a field named environmental package there. That could be the mapping

@hrshdhgd
Copy link
Collaborator

I also just noticed that the accession_biosample_id is just a suffix to the id column if that is of any value.

@hrshdhgd
Copy link
Collaborator

hrshdhgd commented Mar 29, 2021

I have been working on runNER some more and I have added the following features:

  1. Added a column 'SENTENCE' to show the relevant sentence in which the tagged token appears for context to users.
  2. Added a suffix '_SYNONYM' for synonym terms tags by OGER.

Question: @cmungall , while adding the MIxS syntax in the format - LABEL [ENVO:nnnn], would you expect the same format for synonyms e.g. LABEL [ENVO:nnnn_SYNONYM] or no?

hrshdhgd added a commit that referenced this issue Mar 30, 2021
Notebook for documenting steps towards normalizing ENVO terms in the 3 columns - env_broad_scale	env_medium and env_local_scale
turbomam pushed a commit that referenced this issue Jul 2, 2021
Notebook for documenting steps towards normalizing ENVO terms in the 3 columns - env_broad_scale	env_medium and env_local_scale
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants