Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opening virtual datasets (dmr-adapter) #606

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

ayushnag
Copy link
Collaborator

@ayushnag ayushnag commented Jun 18, 2024

  • Closes Opening virtual datasets with NASA dmrpp files #605
  • Add docs
  • Unit tests. Update current test to test specific portions of the virtual dataset
  • Check indirect dmrpp reading support
  • Use updated virtualizarr version (with numpy 2.0 manifest)
  • Update CHANGELOG.md
  • Add tutorial notebook [TODO in later PR]

📚 Documentation preview 📚: https://earthaccess--606.org.readthedocs.build/en/606/

@ayushnag ayushnag marked this pull request as draft June 18, 2024 23:33
@betolink
Copy link
Member

This PR looks good to me (we need to fix some minor formatting issues with Ruff). Maybe the only missing thing would be a notebook demonstrating how to use this feature? @ayushnag

@ayushnag
Copy link
Collaborator Author

ayushnag commented Jun 20, 2024

https://gist.github.com/ayushnag/bcf946a71122f5e7a54bc72b581bd31b

Better viewing experience: https://nbviewer.org/gist/ayushnag/bcf946a71122f5e7a54bc72b581bd31b

If there's more you want me to add or if any step is unclear I can update the notebook

@betolink betolink self-assigned this Jun 25, 2024
Copy link

github-actions bot commented Nov 13, 2024

Binder 👈 Launch a binder notebook on this branch for commit b09c3f0

I will automatically update this comment whenever this PR is modified

Binder 👈 Launch a binder notebook on this branch for commit ed4a98b

Binder 👈 Launch a binder notebook on this branch for commit abe2c28

Binder 👈 Launch a binder notebook on this branch for commit dde9c57

Binder 👈 Launch a binder notebook on this branch for commit a9a9234

Binder 👈 Launch a binder notebook on this branch for commit 8fb0140

Binder 👈 Launch a binder notebook on this branch for commit 2c28278

Binder 👈 Launch a binder notebook on this branch for commit c3e43ac

@chuckwondo
Copy link
Collaborator

@ayushnag, perhaps I'm missing something here, but why have you copied code from the virtualizarr library into earthaccess? I don't see anything gained by this. We can simply use virtualizarr directly. Can you clarify?

@ayushnag
Copy link
Collaborator Author

Yes I can clarify. At some point the dmrpp parser was going to be a part of earthaccess instead of virtualizarr. However that is not the case anymore and this PR will be updated to just call virtualizarr directly

@ayushnag ayushnag marked this pull request as ready for review November 14, 2024 19:03
@ayushnag
Copy link
Collaborator Author

This PR is ready to review now. I have updated the pyproject.toml with an added dependency but let me know if the uv.lock or other files need to modified.

@betolink
Copy link
Member

I can take a look this evening, great work @ayushnag !!

@betolink
Copy link
Member

I think this PR is ready to be merged, there are many considerations to the future of virtual datasets with NASA data but this PR get us closer to a workflow where we won't have to generate metadata when there is a format already available. We need to produce a few examples and put it in the documentation, maybe using SWOT, ICESat-2 and TEMPO datasets cc @danielfromearth @DeanHenze

Is there anything we are missing @TomNicholas @ayushnag?

@betolink betolink requested review from TomNicholas and betolink and removed request for TomNicholas December 10, 2024 17:32
Copy link
Member

@betolink betolink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work Ayush! thank you for all the work you put on this PR and what you did on virtualizarr. I hope we can talk with @jgallagher59701 and Miguel Jimenez today/tomorrow about what's next. Maybe we can have people hacking on this access pattern at the Pangeo event!

earthaccess/virtualizarr.py Outdated Show resolved Hide resolved
open_ = _parse_dmr
vdatasets = [open_(fs=fs, data_path=g.data_links(access=access)[0]) for g in granules]
if preprocess is not None:
vdatasets = [preprocess(ds) for ds in vdatasets]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amazing!

# Open directly with `earthaccess.open`
expected = xr.open_mfdataset(earthaccess.open(granules), concat_dim="time", combine="nested", combine_attrs="drop_conflicts")

result = earthaccess.open_virtual_mfdataset(granules=granules, access="indirect", concat_dime="time", parallel=True, preprocess=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have Dask in the test dependencies?

from __future__ import annotations

import fsspec
import xarray as xr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xarray is not one a core dependency, I think we need to add it as an optional dependency the same way the consolidate_metadata uses Dask (see the pyproject.yaml) and then the tests that are failing should pass!

earthaccess/virtualizarr.py Outdated Show resolved Hide resolved
def open_virtual_mfdataset(
granules: list[earthaccess.DataGranule],
group: str | None = None,
access: str = "indirect",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great that you're explicitly exposing this, I think we are going to deprecate the "magic" of detecting the runtime environment in-region vs out-of-region

pyproject.toml Outdated
@@ -64,6 +64,9 @@ kerchunk = [
"h5netcdf",
"xarray",
]
virtualizarr = [
"virtualizarr @ git+https://github.com/zarr-developers/VirtualiZarr.git"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because we want to be up to date for now? are we targeting an upcoming release?


# TODO: replace with xr.testing when virtualizarr fill_val is fixed (https://github.com/zarr-developers/VirtualiZarr/issues/287)
# and dmrpp deflateLevel (zlib compression level) is always present (https://github.com/OPENDAP/bes/issues/954)
for var in result.variables:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the replacement for "almost equal"?

for g in granules:
vdatasets.append(
open_(
filepath=g.data_links(access=access)[0] + ".dmrpp",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the linter tells me that filepath is not a thing, same with indexes as this uses open_virtual_dataset() from down below in line 146. Maybe we need to refactor it?

@ayushnag
Copy link
Collaborator Author

@betolink One quick comment before we merge. This function doesn't appear to show up in the API reference with the docs build. I think if the new functions are imported with .api that should fix it?

@ayushnag
Copy link
Collaborator Author

Although then the function has to be part of api.py. Maybe not in that case

@betolink
Copy link
Member

@betolink One quick comment before we merge. This function doesn't appear to show up in the API reference with the docs build. I think if the new functions are imported with .api that should fix it?

i think you're correct, I don't think the function needs to be part of API but it definitely needs to be imported, if we want the dosc to show there. I think since we are including it also in the __init__.py it's already part of earthaccess., good catch! do you want to push one commit and see how the docs render?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Opening virtual datasets with NASA dmrpp files
4 participants