-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Opening virtual datasets (dmr-adapter) #606
base: main
Are you sure you want to change the base?
Conversation
This PR looks good to me (we need to fix some minor formatting issues with Ruff). Maybe the only missing thing would be a notebook demonstrating how to use this feature? @ayushnag |
https://gist.github.com/ayushnag/bcf946a71122f5e7a54bc72b581bd31b Better viewing experience: https://nbviewer.org/gist/ayushnag/bcf946a71122f5e7a54bc72b581bd31b If there's more you want me to add or if any step is unclear I can update the notebook |
👈 Launch a binder notebook on this branch for commit b09c3f0 I will automatically update this comment whenever this PR is modified 👈 Launch a binder notebook on this branch for commit ed4a98b 👈 Launch a binder notebook on this branch for commit abe2c28 👈 Launch a binder notebook on this branch for commit dde9c57 👈 Launch a binder notebook on this branch for commit a9a9234 👈 Launch a binder notebook on this branch for commit 8fb0140 👈 Launch a binder notebook on this branch for commit 2c28278 👈 Launch a binder notebook on this branch for commit c3e43ac |
@ayushnag, perhaps I'm missing something here, but why have you copied code from the virtualizarr library into earthaccess? I don't see anything gained by this. We can simply use virtualizarr directly. Can you clarify? |
Yes I can clarify. At some point the dmrpp parser was going to be a part of earthaccess instead of virtualizarr. However that is not the case anymore and this PR will be updated to just call virtualizarr directly |
This PR is ready to review now. I have updated the |
I can take a look this evening, great work @ayushnag !! |
I think this PR is ready to be merged, there are many considerations to the future of virtual datasets with NASA data but this PR get us closer to a workflow where we won't have to generate metadata when there is a format already available. We need to produce a few examples and put it in the documentation, maybe using SWOT, ICESat-2 and TEMPO datasets cc @danielfromearth @DeanHenze Is there anything we are missing @TomNicholas @ayushnag? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work Ayush! thank you for all the work you put on this PR and what you did on virtualizarr
. I hope we can talk with @jgallagher59701 and Miguel Jimenez today/tomorrow about what's next. Maybe we can have people hacking on this access pattern at the Pangeo event!
earthaccess/virtualizarr.py
Outdated
open_ = _parse_dmr | ||
vdatasets = [open_(fs=fs, data_path=g.data_links(access=access)[0]) for g in granules] | ||
if preprocess is not None: | ||
vdatasets = [preprocess(ds) for ds in vdatasets] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amazing!
# Open directly with `earthaccess.open` | ||
expected = xr.open_mfdataset(earthaccess.open(granules), concat_dim="time", combine="nested", combine_attrs="drop_conflicts") | ||
|
||
result = earthaccess.open_virtual_mfdataset(granules=granules, access="indirect", concat_dime="time", parallel=True, preprocess=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have Dask in the test dependencies?
earthaccess/virtualizarr.py
Outdated
from __future__ import annotations | ||
|
||
import fsspec | ||
import xarray as xr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xarray is not one a core dependency, I think we need to add it as an optional dependency the same way the consolidate_metadata uses Dask (see the pyproject.yaml) and then the tests that are failing should pass!
earthaccess/virtualizarr.py
Outdated
def open_virtual_mfdataset( | ||
granules: list[earthaccess.DataGranule], | ||
group: str | None = None, | ||
access: str = "indirect", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great that you're explicitly exposing this, I think we are going to deprecate the "magic" of detecting the runtime environment in-region
vs out-of-region
pyproject.toml
Outdated
@@ -64,6 +64,9 @@ kerchunk = [ | |||
"h5netcdf", | |||
"xarray", | |||
] | |||
virtualizarr = [ | |||
"virtualizarr @ git+https://github.com/zarr-developers/VirtualiZarr.git" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this because we want to be up to date for now? are we targeting an upcoming release?
|
||
# TODO: replace with xr.testing when virtualizarr fill_val is fixed (https://github.com/zarr-developers/VirtualiZarr/issues/287) | ||
# and dmrpp deflateLevel (zlib compression level) is always present (https://github.com/OPENDAP/bes/issues/954) | ||
for var in result.variables: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the replacement for "almost equal"?
earthaccess/virtualizarr.py
Outdated
for g in granules: | ||
vdatasets.append( | ||
open_( | ||
filepath=g.data_links(access=access)[0] + ".dmrpp", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the linter tells me that filepath
is not a thing, same with indexes
as this uses open_virtual_dataset()
from down below in line 146. Maybe we need to refactor it?
@betolink One quick comment before we merge. This function doesn't appear to show up in the API reference with the docs build. I think if the new functions are imported with .api that should fix it? |
Although then the function has to be part of |
i think you're correct, I don't think the function needs to be part of API but it definitely needs to be imported, if we want the dosc to show there. I think since we are including it also in the |
virtualizarr
version (with numpy 2.0 manifest)Add tutorial notebook [TODO in later PR]📚 Documentation preview 📚: https://earthaccess--606.org.readthedocs.build/en/606/