implementing a rolling mean function #214

ahmadtourei · 2023-08-01T23:20:15Z

ahmadtourei
Aug 1, 2023
Collaborator

Based on my preliminary investigations, Pandas' rolling function provides decent results compared to true low-passing. Therefore, I'm thinking of implementing a rolling mean function in DASCore. Basically, it would be calling a Pandas rolling function to process a patch of data:

import dascore as dc
import pandas as pd

d_t = 10.0 # in sec
d_f = pd.DataFrame(dc.spool(data_path)[0].data)
r_m = df.rolling(window=int(sampling_rate*d_t), step=int(sampling_rate*d_t), axis=0).mean()

@d-chambers Would you mind adding instructions on how can we implement this? So then I'd also be able to document implementing a function like this to DASCore (like what we already have for IO).

d-chambers · 2023-08-02T19:31:16Z

d-chambers
Aug 2, 2023
Maintainer

Hey @ahmadtourei,

Having rolling window functions in DASCore is a great idea! There are a few problems with using pandas rolling though. First, there is overhead in stuffing the data (numpy array) into a dataframe, and second, it precludes the use of patches with more than 2 dimensions (eg the output of spectrogram).

I think an approach based on bottleneck's moving functions is probably more optimal. If I recall correctly, pandas and xarray are using bottleneck under the hood anyway. Bottleneck handles numpy arrays directly and allow arbitrary dimensions. I don't know yet what the interface should be though. Two possibilities come to mind:

import dascore as dc
from dascore.units import s

patch = dc.get_example_patch()

# a rolling intermediate object
rolling_mean = patch.rolling(time=10*s).mean()

# or just separate methods
rolling_mean = patch.rolling_mean(time=10*s)

Which do you like better? Is there another we should consider?

Another reference is xarray's rolling method.

0 replies

ahmadtourei · 2023-08-02T22:07:52Z

ahmadtourei
Aug 2, 2023
Collaborator Author

@d-chambers I agree that it'd be nice to use a function that directly works with numpy arrays. bottleneck is a great one. Another possibility is using the convolve function:

import dascore as dc
import numpy as np

def rolling_mean(arr, window_size):
    # Define the kernel for the running mean average
    kernel = np.ones(window_size) / window_size

    # Apply the running mean average using numpy.convolve()
    result = np.convolve(arr, kernel, mode='valid')

    return result

patch = dc.get_example_patch()
d_t = 10.0
sampling_rate = 1000
window_size = d_t*sampling_rate
channel = 3000

result_rm = rolling_mean(patch.data[:, channel], window_size)

However, this is not optimum for multichannel datasets like DAS because the numpy convolve function does not get an axis argument.

I like the rolling_mean = patch.rolling_mean(time=10*s) but the problem is the bottleneck move_mean function does not handle step calculation so it would do it on all samples (instead of on each step samples), which is not efficient for low-frequency processing. We need to take care of that in our rolling function:
rolling_mean = patch.rolling_mean(time=10*s, step=10*s)

0 replies

ahmadtourei · 2023-08-04T20:14:52Z

ahmadtourei
Aug 4, 2023
Collaborator Author

How about implementing this?

def calculate_mean_over_samples(data, window_size = 1000, step_size=1000, axis=0):
    total_samples = data.shape[0]
    mean_values = np.empty((int(total_samples/step_size),data.shape[1]))
    for j, k in enumerate(range(0, total_samples, step_size)):
        mean_values[j,:] = np.mean(data[k:k+window_size, :], axis=axis)
    return mean_values


for i, patch in enumerate (sub_sp):
    window_size= int(dt*sampling_rate)
    rolling_mean = calculate_mean_over_samples(patch.data, window_size=window_size, step_size=window_size) 

    new_attrs = dict(patch.attrs)
    samples = np.array(patch.coords["time"])[::step_size]
    new_coords = {x: patch.coords[x] for x in patch.dims}
    new_coords["time"] = samples
    new_attrs[f"d_time"] = dt
    new_attrs[f"min_time"] = np.min(samples)
    new_attrs[f"max_time"] = np.max(samples)
    new_patch = patch.new(data=rolling_mean_scaled, attrs=new_attrs, dims=patch.dims, coords=new_coords)

    # save the result in the output folder
    new_patch.io.write(file_name, "dasdae")

If you agree, please provide some instructions so I can start implementing this in the master branch. Is interpolate function a good example (but calling my calculate_mean_over_samples function instead of scipy.interpolate.interp1d)?

0 replies

ahmadtourei · 2023-08-17T01:02:07Z

ahmadtourei
Aug 17, 2023
Collaborator Author

Here I provide some timing results (in sec.) comparing the calculate_mean_over_samples function with pandas df.rolling function:

1. calculate_mean_over_samples
    a) window size = 1 sample
        window = 1 and step = 1  :  65.60931515693665
        window = 1 and step = 10  :  25.854279041290283
        window = 1 and step = 100  :  20.65814471244812
        window = 1 and step = 1000  :  20.52034902572632
        window = 1 and step = 4000  :  20.242999792099
        window = 1 and step = 10000  :  20.29579997062683
        window = 1 and step = 30000  :  20.299068212509155
        window = 1 and step = 60000  :  20.726847410202026

    b) window size = 1000 samples
        window = 1000 and step = 1  :  1572.4103164672852
        window = 1000 and step = 10  :  172.72687029838562
        window = 1000 and step = 100  :  39.86638879776001
        window = 1000 and step = 1000  :  22.341382265090942
        window = 1000 and step = 4000  :  20.29269242286682
        window = 1000 and step = 10000  :  20.57712721824646
        window = 1000 and step = 30000  :  20.42655873298645
        window = 1000 and step = 60000  :  20.452399730682373

2. df.rolling
    a) window size = 1 sample        
        window = 1 and step = 1  :  115.75523614883423
        window = 1 and step = 10  :  50.86222839355469
        window = 1 and step = 100  :  46.9692645072937
        window = 1 and step = 1000  :  46.654056787490845
        window = 1 and step = 4000  :  47.823999881744385
        window = 1 and step = 10000  :  48.38563847541809
        window = 1 and step = 30000  :  46.87156963348389
        window = 1 and step = 60000  :  48.00695753097534

    b) window size = 1000 samples
        window = 1000 and step = 1  :  132.51058530807495
        window = 1000 and step = 10  :  90.97765493392944
        window = 1000 and step = 100  :  95.07050561904907
        window = 1000 and step = 1000  :  67.88734674453735
        window = 1000 and step = 4000  :  49.372252225875854
        window = 1000 and step = 10000  :  46.20539927482605
        window = 1000 and step = 30000  :  46.36953067779541
        window = 1000 and step = 60000  :  45.15775418281555

Please note that all timing results are for processing of 1 hour (3,600,000 samples) and 1000 channels.
As we can see, if we use a very small step size (i.e., step = 1 or 10) and a large window size, the df.rolling is more efficient, but in other cases, the calculate_mean_over_samples is more efficient.

So, since the bottleneck move_mean function does not support slicing, what if we use move_mean function if we are not slicing (i.e., step=None or step=1,) and use the calculate_mean_over_samples considering step size?

I can provide some timing results comparing the bottleneck move_mean function with calculate_mean_over_samples later if you agree to this.

0 replies

d-chambers · 2023-08-17T16:44:53Z

d-chambers
Aug 17, 2023
Maintainer

Looks promising. Yes, I definitely agree we rolling functions. Unfortunately, we can't really df.rolling because we want to support more than just 2D patches but it is a good reference.

I can provide some timing results comparing the bottleneck move_mean

That would be great. I am curious about manually striding the data after move_mean to get the right step. For example:

import bottleneck

def move_mean(array, window, step=1, axis=-1):
    """moving mean function with step using bottleneck."""
    out = bottleneck.move_mean(array, window, axis=axis)[::step]
    return out

It might seem a bit wasteful but it might be faster (maybe?).

Ok, so let's come up with a plan forward. Here is what I am thinking:

I am leaning towards copying a subset of DataFrame.rolling's API, mainly using the rolling method to return an intermediary object. That way we can eventually support many different rolling operations without congesting the Patch namespace with many rolling_x type methods (looking at xarray's rolling object there are over a dozen different rolling operations!). We can also defer to Dataframe.rolling on edge behavior (e.g., how many/if to include NaNs on edges, etc.).

I also love what pandas has done by allowing (optional) numba acceleration. That way users can efficiently apply their own jit'ed rolling window functions.

To start, let's create a new branch called "rolling". In it, we can create a new module in dascore/proc called rolling.py. It will also have a test file (tests/test_proc/test_rolling.py). The first thing to make is the rolling class. Something like this:

from dascore.utils.patch import get_dim_value_from_kwargs


class PatchRoller:
    """A class to apply roller operations to patches"""
    def __init__(self, patch,  *, step=None, center=False, **kwargs):
        ....

    def mean(self):
        # your function here


def rolling(patch, *, step=None, center=False):
    """
    nice docs
    """
    dim, axis, value = get_dim_value_from_kwargs(patch, kwargs)
    # probably other things I am missing. 
    return PatchRoller(patch, step=step, center=center)

We can also work on other useful rolling functions such as median, max, min, etc.

Some of the more complex bits will be supporting units for kwargs (which specify the dimension name along which to apply the rolling operation) and step, as well as implementing apply for applying a custom function. We can probably save apply for a different PR but unit support should be included in the first go.

Are you going to be back on campus next week? If we can set aside a few hours to work on this (in person) together it might be more effective, but if you want to get a start on it feel free!

0 replies

ahmadtourei · 2023-08-17T18:14:25Z

ahmadtourei
Aug 17, 2023
Collaborator Author

Thanks for your thoughts!

I am curious about manually striding the data after move_mean to get the right step. For example:

The problem would be it calculates the mean value of a window over all samples (not fast) and then resample based on the step.

Are you going to be back on campus next week? If we can set aside a few hours to work on this (in person) together it might be more effective, but if you want to get a start on it feel free!

I agree! Yes, I will be back. My calendar is updated, so feel free to schedule a meeting whenever works for you.

0 replies

d-chambers · 2023-08-17T18:25:38Z

d-chambers
Aug 17, 2023
Maintainer

The problem would be it calculates the mean value of a window over all samples (not fast) and then resample based on the step.

That's true, but bottleneck is not performing the same numpy operations as calculate_mean_over_samples, so there is a chance it might be better when step is lower (my guess is when step <= 10% window) but its best to measure and not guess. We can work on this later if you are busy with other things though.

0 replies

ahmadtourei · 2023-08-23T16:54:19Z

ahmadtourei
Aug 23, 2023
Collaborator Author

I'm providing some timing results (in sec.) comparing the bottleneck with calculate_mean_over_samples function and pandas df.rolling function:

3. bottleneck
a) window size = 1 samples
    time for window = 1 and step = 1  :  134.52701878547668
    time for window = 1 and step = 10  :  131.66450810432434
    time for window = 1 and step = 100  :  123.28693079948425
    time for window = 1 and step = 1000  :  123.23459124565125
    time for window = 1 and step = 4000  :  125.20857310295105
    time for window = 1 and step = 10000  :  127.58820080757141
    time for window = 1 and step = 30000  :  126.17183947563171
    time for window = 1 and step = 60000  :  121.31124567985535

b) window size = 1000 samples:
    time for window = 1000 and step = 1  :  1828.8224875926971
    time for window = 1000 and step = 10  :  1745.913364648819
    time for window = 1000 and step = 100  :  1739.533694267273
    time for window = 1000 and step = 1000  :  1725.7360785007477
    time for window = 1000 and step = 4000  :  1733.3449218273163
    time for window = 1000 and step = 10000  :  1720.0374190807343
    time for window = 1000 and step = 30000  :  1978.5618741512299
    time for window = 1000 and step = 60000  :  1732.2479333877563

Looks not quite efficient, even for small windows and steps.

0 replies

d-chambers · 2023-08-31T15:34:59Z

d-chambers
Aug 31, 2023
Maintainer

closed by #238

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implementing a rolling mean function #214

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

implementing a rolling mean function #214

ahmadtourei Aug 1, 2023 Collaborator

Replies: 9 comments

d-chambers Aug 2, 2023 Maintainer

ahmadtourei Aug 2, 2023 Collaborator Author

ahmadtourei Aug 4, 2023 Collaborator Author

ahmadtourei Aug 17, 2023 Collaborator Author

d-chambers Aug 17, 2023 Maintainer

ahmadtourei Aug 17, 2023 Collaborator Author

d-chambers Aug 17, 2023 Maintainer

ahmadtourei Aug 23, 2023 Collaborator Author

d-chambers Aug 31, 2023 Maintainer

ahmadtourei
Aug 1, 2023
Collaborator

d-chambers
Aug 2, 2023
Maintainer

ahmadtourei
Aug 2, 2023
Collaborator Author

ahmadtourei
Aug 4, 2023
Collaborator Author

ahmadtourei
Aug 17, 2023
Collaborator Author

d-chambers
Aug 17, 2023
Maintainer

ahmadtourei
Aug 17, 2023
Collaborator Author

d-chambers
Aug 17, 2023
Maintainer

ahmadtourei
Aug 23, 2023
Collaborator Author

d-chambers
Aug 31, 2023
Maintainer