Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parsing module #2

Merged
merged 2 commits into from
Jul 27, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,38 @@
# pangeo-forge-esgf
Using queries to the ESGF API to generate urls and keyword arguments for receipe generation in pangeo-forge


## Parsing a list of instance ids using wildcards
Pangeo forge recipes require the user to provide exact instance_id's for the datasets they want to be processed. Discovering these with the [web search](https://esgf-node.llnl.gov/search/cmip6/) can become cumbersome, especially when dealing with a large number of members/models etc.

`pangeo-forge-esgf` provides some functions to query the ESGF API based on instance_id values with wildcards.

For example if you want to find all the zonal (`uo`) and meridonal (`vo`) velocities available for the `lgm` experiment of PMIP, you can do:

```python
from pangeo_forge_esgf.parsing import parse_instance_ids
parse_iids = [
"CMIP6.PMIP.*.*.lgm.*.*.uo.*.*",
"CMIP6.PMIP.*.*.lgm.*.*.vo.*.*",
]
iids = []
for piid in parse_iids:
iids.extend(parse_instance_ids(piid))
iids
```

and you will get:
```
['CMIP6.PMIP.MIROC.MIROC-ES2L.lgm.r1i1p1f2.Omon.uo.gn.v20191002',
'CMIP6.PMIP.AWI.AWI-ESM-1-1-LR.lgm.r1i1p1f1.Odec.uo.gn.v20200212',
'CMIP6.PMIP.AWI.AWI-ESM-1-1-LR.lgm.r1i1p1f1.Omon.uo.gn.v20200212',
'CMIP6.PMIP.MIROC.MIROC-ES2L.lgm.r1i1p1f2.Omon.uo.gr1.v20200911',
'CMIP6.PMIP.MPI-M.MPI-ESM1-2-LR.lgm.r1i1p1f1.Omon.uo.gn.v20200909',
'CMIP6.PMIP.AWI.AWI-ESM-1-1-LR.lgm.r1i1p1f1.Omon.vo.gn.v20200212',
'CMIP6.PMIP.MIROC.MIROC-ES2L.lgm.r1i1p1f2.Omon.vo.gn.v20191002',
'CMIP6.PMIP.AWI.AWI-ESM-1-1-LR.lgm.r1i1p1f1.Odec.vo.gn.v20200212',
'CMIP6.PMIP.MIROC.MIROC-ES2L.lgm.r1i1p1f2.Omon.vo.gr1.v20200911',
'CMIP6.PMIP.MPI-M.MPI-ESM1-2-LR.lgm.r1i1p1f1.Omon.vo.gn.v20190710']
```

Eventually I hope I can leverage this functionality to handle user requests in PRs that add wildcard instance_ids, but for now this might be helpful to manually construct lists of instance_ids to submit to a pangeo-forge feedstock.
45 changes: 45 additions & 0 deletions pangeo_forge_esgf/parsing.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
import requests
import json
from .utils import facets_from_iid

def request_from_facets(url, **facets):
params = {
"type": "Dataset",
"retracted": "false",
"format": "application/solr+json",
"fields": "instance_id",
"latest": "true",
"distrib": "true",
"limit": 500,
}
params.update(facets)
return requests.get(url=url, params=params)


def instance_ids_from_request(json_dict):
iids = [item["instance_id"] for item in json_dict["response"]["docs"]]
uniqe_iids = list(set(iids))
return uniqe_iids


def parse_instance_ids(iid: str) -> list[str]:
"""Parse an instance id with wildcards"""
facets = facets_from_iid(iid)
#convert string to list if square brackets are found
for k,v in facets.items():
if '[' in v:
v = v.replace('[', '').replace(']', '').replace('\'', '').replace(' ','').split(',')
facets[k] = v
facets_filtered = {k: v for k, v in facets.items() if v != "*"}

#TODO: I should make the node url a keyword argument. For now this works well enough
url="https://esgf-node.llnl.gov/esg-search/search"
# url = "https://esgf-data.dkrz.de/esg-search/search"
# TODO: how do I iterate over this more efficiently? Maybe we do not want to allow more than x files parsed?
resp = request_from_facets(url, **facets_filtered)
if resp.status_code != 200:
print(f"Request [{resp.url}] failed with {resp.status_code}")
return resp
else:
json_dict = resp.json()
return instance_ids_from_request(json_dict)