Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving performance for h5netcdf #195

Open
hmaarrfk opened this issue Nov 30, 2022 · 13 comments
Open

Improving performance for h5netcdf #195

hmaarrfk opened this issue Nov 30, 2022 · 13 comments

Comments

@hmaarrfk
Copy link
Contributor

What happened:

I've been trying to use h5netcdf since it gives access to the underlying h5py dataset which makes it easier to prototype with different HDF5 backends. netcdf4-c is much harder to compile and thus prototype with since many of the internals are not accessible.

My original concern arose from observations that, for some of our datasets, they open with

  • xarray + netcdf4-python in about 80 ms
  • xarray + h5netcdf: 600 ms.

Ultimately, I believe that I think things boil down to the fact an access to the _h5ds property is quite costly. This makes "every" operation "slow".

the code below:

import h5netcdf
from pathlib import Path

h5nc_file = h5netcdf.File(
    Path.home() / 'sample_file_zebrafish_0.18.55_4uCams_small.nc',
)

images = h5nc_file.variables['images']
for i in range(1000):
    images.shape

Using the spyder profile, one finds that 1000 calls to size, result in about 8000 calls to _h5ds
image

import h5netcdf
from pathlib import Path

h5nc_file = h5netcdf.File(
    Path.home() / 'sample_file_zebrafish_0.18.55_4uCams_small.nc',
)

images = h5nc_file.variables['images']

for i in range(8000):
    images._h5ds

This is because this property is created on demand.

I think this has cascading effects because everything really relies on the access to the underlying h5py structure.

A previous suggestion tried to remove the property, but it seems like there was some concern about "resetting" the underlying _h5py pointer. I am unclear on where this kind of usecase would come up. My understanding is that this pointer would be something unchanging.

Expected Output

As a comparison, we can compare how long it takes to access the shape property from netcdf4:

import h5netcdf
from pathlib import Path
import netCDF4

h5nc_file = netCDF4.Dataset(
    str(Path.home() / 'sample_file_zebrafish_0.18.55_4uCams_small.nc'),
)

images = h5nc_file.variables['images']
for i in range(1000):
    images.shape

About 400 us instead of 350 ms.
image

Anything else we need to know?:

The sample image has 4 dimensions

sample_file_zebrafish_0.18.55_4uCams_small.zip

Version

Output of print(h5py.version.info, f"\nh5netcdf {h5netcdf.__version__}")

Summary of the h5py configuration

h5py 3.7.0
HDF5 1.12.2
Python 3.9.13 | packaged by ...
[GCC 10.4.0]
sys.platform linux
sys.maxsize 9223372036854775807
numpy 1.23.5
cython (built with) 0.29.32
numpy (built against) 1.20.3
HDF5 (built against) 1.12.2

h5netcdf 0.14.0.dev51+g8302776

@hmaarrfk
Copy link
Contributor Author

As an example, the following patch, avoids the creation of the _h5ds on demand, (it keeps a had reference is).

diff --git a/h5netcdf/core.py b/h5netcdf/core.py
index 39b0b5a..cb7f2b1 100644
--- a/h5netcdf/core.py
+++ b/h5netcdf/core.py
@@ -114,6 +114,10 @@ class BaseVariable(object):
         self._dimensions = dimensions
         self._initialized = True
 
+        # Always refer to the root file and store not h5py object
+        # subclasses:
+        self._h5ds = self._root._h5file[self._h5path]
+
     @property
     def _parent(self):
         return self._parent_ref()
@@ -122,12 +126,6 @@ class BaseVariable(object):
     def _root(self):
         return self._root_ref()
 
-    @property
-    def _h5ds(self):
-        # Always refer to the root file and store not h5py object
-        # subclasses:
-        return self._root._h5file[self._h5path]
-
     @property
     def name(self):
         """Return variable name."""
diff --git a/h5netcdf/dimensions.py b/h5netcdf/dimensions.py
index d291731..d5bda4d 100644
--- a/h5netcdf/dimensions.py
+++ b/h5netcdf/dimensions.py
@@ -86,8 +86,13 @@ class Dimension(object):
         else:
             self._root._max_dim_id += 1
         self._dimensionid = self._root._max_dim_id
-        if parent._root._writable and create_h5ds and not self._phony:
+        if self._phony:
+            self._h5ds = None
+        elif parent._root._writable and create_h5ds:
+            # Create scale will create the self._h5ds object
             self._create_scale()
+        else:
+            self._h5ds = self._root._h5file[self._h5path]
         self._initialized = True
 
     @property
@@ -131,12 +136,6 @@ class Dimension(object):
             return False
         return self._h5ds.maxshape == (None,)
 
-    @property
-    def _h5ds(self):
-        if self._phony:
-            return None
-        return self._root._h5file[self._h5path]
-
     @property
     def _isscale(self):
         return h5py.h5ds.is_scale(self._h5ds.id)
@@ -183,6 +182,7 @@ class Dimension(object):
                 dtype=">f4",
                 **kwargs,
             )
+        self._h5ds = self._root._h5file[self._h5path]
         self._h5ds.attrs["_Netcdf4Dimid"] = np.array(self._dimid, dtype=np.int32)
 
         if len(self._h5ds.shape) > 1:

but brings down from 500 ms (on my computer) to 22 ms
image

@hmaarrfk
Copy link
Contributor Author

Relevant discussion: #117

@kmuehlbauer
Copy link
Collaborator

@hmaarrfk Thanks for providing all that information and your engagement to make h5netcdf more performant. I've taken a first step to minimize re-reading from underlying h5py/h5df in #196. The changes are minor, but have a huge effect on the presented use-case.

I've tried your patch here, but it broke the test suite. I'll look deeper into this the next days.

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Dec 1, 2022

Thank you! My patch was more a proof of concept that h5netcdf doesn't have to be slow. I was trying to get pympler installed to recreate the cyclic dependency graphs before commenting further.

I also think that we (personally) need to tune our file structure. I remember reading about HDF5 metadata needing to be tuned for "large" files.

Were you able to recreate similar performance trends with a newly created hand crafted file? We likely generated the original netcdf4 with netcdf4-c (primarily due to performance concerns with h5netcdf).

@kmuehlbauer
Copy link
Collaborator

@hmaarrfk Do you think it would be useful to create several datasets from scratch with netCDF4-python/h5netcdf? We could build a benchmark suite with your and others specific use cases from that, also including examples from the past. At least this would give us some comparison between different approaches.

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Dec 3, 2022

We could build a benchmark suite

Building benchmarking infrastructure is non-trivial. I'm happy to help creating the datasets, and "donating" some "real world" dataset.

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Dec 3, 2022

For this particular case, I found that it isn't dependent on my own nc file

import h5netcdf
from pathlib import Path
from time import perf_counter, sleep

import tempfile

with tempfile.TemporaryDirectory() as tmpdir:
    filename = Path(tmpdir) / 'h5netcdf_test.nc'

    h5nc_file = h5netcdf.File(filename, 'w')
    h5nc_file.dimensions = {'x': 1024}
    x = h5nc_file.dimensions['x']

    time_start = perf_counter()
    for i in range(1000):
        len(x)
    time_end = perf_counter()
    time_elapsed = time_end - time_start
    print(f"{time_elapsed:03.3f} --- Time to compute length 1000 times: ")
        
    time_start = perf_counter()
    for i in range(1000):
        x._h5ds
    time_end = perf_counter()
    time_elapsed = time_end - time_start
    print(f"{time_elapsed:03.3f} --- Time to access h5ds reference 1000 times")
    
    time_start = perf_counter()
    for i in range(1000):
        x._root._h5file[x._h5path]
    time_end = perf_counter()
    time_elapsed = time_end - time_start
    print(f"{time_elapsed:03.3f} --- Time to create h5ds reference 1000 times")
        
    h5nc_file.close()
    
# %% on main
# 0.053 --- Time to compute length 1000 times: 
# 0.046 --- Time to access h5ds reference 1000 times
# 0.046 --- Time to create h5ds reference 1000 times

# %% On https://github.com/h5netcdf/h5netcdf/pull/197
# 0.003 --- Time to compute length 1000 times: 
# 0.000 --- Time to access h5ds reference 1000 times
# 0.034 --- Time to create h5ds reference 1000 times

@veenstrajelmer
Copy link

veenstrajelmer commented Aug 21, 2024

I think I might be encountering the same issue when using xarray with engine="h5netcdf" (default is engine="netcdf4"). The memory usage is way lower with h5netcdf but reading the files is way slower. When profiling this I see that more than half the time is spent on name() in base.py, which seems not to be a complex function:
https://github.com/h5py/h5py/blob/f88df800a209d743d5dc1c8b30f0c1adc758ccf1/h5py/_hl/base.py#L287-L289

def name(self):
    """ Return the full name of this object.  None if anonymous. """
    return self._d(h5i.get_name(self.id))

However, the netcdf dataset I am testing with has >2410 variables and the call to name() is done >7300 times, which accounts for 29.8 seconds of the total 49.5 seconds for xr.open_mfdataset(). This seems to me a bit much. could it be that if you unfold your profiler it also ends up with a significant time spent on calls to name() or is it something related or completely different. I also found the h5py issue h5py/h5py#2395 to be potentially relevant.

@hmaarrfk
Copy link
Contributor Author

I'm not sure if i cross referenced it here, but ultimately some of my speedups came in conflict with some file caching I couldn't figure out.

I tried to document it pydata/xarray#7359

@veenstrajelmer
Copy link

veenstrajelmer commented Aug 30, 2024

@hmaarrfk it seems that your xarray question is not being answered and the PR from @kmuehlbauer would be a useful fix but was also not merged. What is the best way to go forward? Is it realistic that either of these will be picked up further on the short term?

@hmaarrfk
Copy link
Contributor Author

I'm honestly not sure. I've tried to understand the xarray caching file manager. It doesn't seem super intuitive to edit.

I feel like for many of my use cases, we've someewhat learned to live with it.

Sometimes it helps to learn that other people care. I open alot of issues on OSS projects, so sometimes its easy to think that "Mark is just opening them for awareness".

For example, I still owe this metadata:
pydata/xarray#9003 (comment)

@hmaarrfk
Copy link
Contributor Author

Sometimes it helps to learn that other people care. I open alot of issues on OSS projects, so sometimes its easy to think that "Mark is just opening them for awareness".

I encourage you to comment on the issues I linked to.

@veenstrajelmer
Copy link

@hmaarrfk I cannot judge the last issue you linked, as for much of the other linked PR's and issues. I would gladly comment on issues if that seems helpful (I did on the one you linked just yet), but not sure where and who. I hope the open PR's/issues can be consolidated a bit and that the separate efforts can be combined.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants