Cuda zero copy #13002

scottrbrtsn · 2024-12-03T14:53:52Z

Reference issue (if any)

What does this implement/fix?

Looking through the cuda options for running signal transformations, I noticed the possibility of leveraging zero-copy methods for further speedup. These were adopted from cuSignal. The cuSignal README Quickstart illlustrates an example of how to allocate shared memory. This method get_shared_memory did not migrate into cupy like the other methods and so I added it here for mne.

Per some example benchmarks shared in the most relevante cuda issue from mne, this simple adjustment potentially cuts resampling time in half.

Additional information

Benchmarks are hard. I'm not 100% confident this is a guaranteed performance gain. When running cold, the difference might not be noticeable, but once the gpu is warm, subsequent runs appear to be significantly different (on my machine). My initial tests with mne_shared_test.py to be the final version for mne implemented here.
I grep'ed for all occurrances of cupy.array and am surprised there are only 3. I'm wondering if other areas of mne would benefit which are still just using np.asarray.
This implementation is the simple way, to offer cupy.asarray which will copy if shared memory is not provided, but will use shared memory if the array is already allocated on the device. This puts the burden onto the caller to provide shared memory space (as demonstrated in the tests).

welcome · 2024-12-03T14:53:54Z

Hello! 👋 Thanks for opening your first pull request here! ❤️ We will try to get back to you soon. 🚴

mne/cuda.py

scottrbrtsn · 2024-12-03T17:32:32Z

Where is cupy added as a dependency? I didn't see it in the pyproject.toml.

larsoner · 2024-12-03T17:34:18Z

It's an optional dependency and I don't think we've added a pip install mne[cuda] though we could

scottrbrtsn · 2024-12-03T17:35:40Z

Sorry I meant cupy.
This adds numba as a dependency. It's what is used to allocate shared memory.
Should it be added as an optional dependency?

larsoner · 2024-12-03T17:40:23Z

This adds numba as a dependency. It's what is used to allocate shared memory.

Numba is already an optional dependency. We shouldn't make it mandatory. And we should also make it so that CUDA can still be used without numba. Numba is not always easy to install...

scottrbrtsn · 2024-12-03T17:42:24Z

Got it. I saw tests failed bc numba couldn't be found. I must need to tweak something probably.

Following the pattern, as it stands numba is only imported when get_shared_memory is called.

larsoner · 2024-12-03T17:43:27Z

Got it. I saw tests failed bc numba couldn't be found. I must need to tweak something probably.

You should be able to use pytest.importorskip in the tests that need it

scottrbrtsn · 2024-12-03T17:53:04Z

Why isn't cupy skipped with pytest.importorskip?

scottrbrtsn · 2024-12-03T17:56:19Z

And then I don't see where other tests skip numba. Sorry, I haven't skipped things like this so just trying to understand the context.

larsoner · 2024-12-03T17:56:45Z

Because if you do n_jobs="cuda" but don't have cupy installed it is the same as just doing n_jobs=1. So no need to skip those tests, they just run (uninterestingly, redundantly) as if n_jobs=1 had been passed.

scottrbrtsn · 2024-12-03T18:01:15Z

aha...ok got it.

If these tests are skipped, do they get run somewhere else?

larsoner · 2024-12-03T18:02:42Z

Oh, actually no tests are skipped for numba because those also run just fine if numba is not installed (it just won't use numba for the computations)!

I think you're actually in a similar situation here where the code should run regardless of whether or not numba and/or cupy are installed. But if both are installed, it should use shared memory when possible. So no need for pytest.importorskip...

To actually know whether or not the shared memory paths are used we check the coverage, or (better) use some logger.debug statement in the shared-mem codepath then capture the logging output. If both modules are installed, make sure the shared-mem code path logger.debug line is emitted. If one or neither module is installed, assert that the logger.debug line is not in the log message. See for example how we capture then check the log messages to make sure the correct number of intervals are used in this preprocessing method:

mne-python/mne/preprocessing/tests/test_maxwell.py

Lines 1711 to 1716 in a1a05ae

    
           with catch_logging() as log: 
        
               want_noisy, want_flat = find_bad_channels_maxwell( 
        
                   raw.copy().crop(n / raw.info["sfreq"], None), min_count=1, verbose="debug" 
        
               ) 
        
           log = log.getvalue() 
        
           assert "in 2 intervals " in log

scottrbrtsn · 2024-12-04T12:28:07Z

Ok, I added a gate, to not get shared memory if cuda is not enabled.

The logging is a bit less straightforward, so thinking about how to do that.

I found this, seems a bit more relevant to what I need to do?

mne-python/mne/tests/test_filter.py

Lines 831 to 838 in a1a05ae

    
           out = log_file.getvalue().split("\n")[:-1] 
        
           # triage based on whether or not we actually expected to use CUDA 
        
           from mne.cuda import _cuda_capable  # allow above funs to set it 
        
           tot = 12 if _cuda_capable else 0 
        
           assert sum(["Using CUDA for FFT FIR filtering" in o for o in out]) == tot 
        
           if not _cuda_capable: 
        
               pytest.skip("CUDA not enabled")

scottrbrtsn · 2024-12-04T12:33:01Z

Side thought: logging in mne/cuda is all at the INFO level. Should those be at the debug level?

scottrbrtsn · 2024-12-04T13:10:45Z

I opted to gate get_shared_mem with has_numba from mne.fixes. This follows the pattern of if _cuda_capable...

This way the test continues, and the user isn't forced to install either one (that's the goal 😄🤞).

scottrbrtsn · 2024-12-04T18:39:07Z

I see the error for doc/python_reference.rst, but I'm confused because I cannot seem to find that file...? I assumed sphinx autocreated things.

I'm not sure where to put the function ref for the documentation.

larsoner · 2024-12-04T18:41:00Z

Side thought: logging in mne/cuda is all at the INFO level. Should those be at the debug level?

I haven't looked but I think the ones in there were chosen on purpose because they might be useful to know a bit about what is going on.

This way the test continues, and the user isn't forced to install either one (that's the goal 😄🤞).

Yes based on what you're saying this sounds reasonable. I'll look at the code in a bit but hopefully can give you a quick pointer about python_reference.rst (looking for that part now)

larsoner · 2024-12-04T18:41:26Z

doc/changes/devel/13002.other.rst

@@ -0,0 +1 @@
+Short description of the changes, by :newcontrib:`Scott Robertson`.


Just adding a comment so we don't forget to actually update this 🙂

... and adding your name to doc/changes/names.inc will fix the CircleCI error:

[towncrier-fragments]:89: ERROR: Indirect hyperlink target "new contributor Scott Robertson" refers to target "scott robertson", which does not exist. [docutils]

mne/cuda.py

mne/tests/test_filter.py

larsoner · 2024-12-04T18:49:31Z

mne/cuda.py

@@ -19,6 +20,53 @@
 _cuda_capable = False


+def get_shared_mem(


Is there a reason this needs to be public? The fewer things we need in our public API the better. Then all of this shared mem stuff can just happen automagically. So I'm inclined to say this should be _get_shared_mem.

And really we should only add more options when they're needed, so I'm not sure we need all the strides etc. options yet.

And to simplify things and make them more DRY, I'd be tempted to call this:

def _share_cuda_mem(x, n_jobs): from mne.fixes import has_numba # so it can be monkey-patched in tests if n_jobs == "cuda" and _cuda_capable and has_numba: from numba import cuda out = cuda.mapped_array(x.shape, ...) out[:] = x else: out = x return out

Our CIs won't complain about a new public function not being documented, and some of the code below gets simpler and more DRY because you can just do x = _share_cuda_mem(x, n_jobs) (rather than repeat the same conditional in three places).

ok. I hear you on making this more DRY. I'm not sure I'm all the way there yet.

_share_cuda_mem implies only calls from within mne.cuda. This is fine.

However, I think, when _cuda_upload_rfft is called by fft_resample n_jobs' is not available. n_jobs` has been returned as 1 (i.e. 1 cpu job, to run parallel with cuda via gpu).

And then the cleanest driest, would be for mne.filter to _share_cuda_mem (making it no longer a private call...)

Or I'm missing the best place for _share_cuda_mem to be called. I'll keep thinking on it. Still spinning up on the logic flow, I feel slow, lol.

Further, _share_cuda_mem(x, n_jobs) ...could this just be _share_cuda_mem(x). bc, this is private, always called from within cuda, so assumed to be cuda? or do what I did, and pass in "cuda" when we're already to a point of knowing?

However, I think, when _cuda_upload_rfft is called by fft_resample n_jobs' is not available. n_jobs` has been returned as 1 (i.e. 1 cpu job, to run parallel with cuda via gpu).

If it's inside a call that we know is in the cuda-only path then you would just call _share_cuda_mem(x, 'cuda')

And then the cleanest driest, would be for mne.filter to _share_cuda_mem (making it no longer a private call...)

This would still be private. Private vs public doesn't refer to existence in mne.filter vs mne.cuda, it refers to the leading underscore. mne.filter._share_cuda_mem and mne.cuda._share_cuda_mem are both private in the sense that they can be used inside our codebase however we need, but we can move EDIT: and change them without any API deprecation period. (Users should never use a private attribute / method / function, i.e., one with a leading _ in the name or in a so-named namespace.)

copy. I'm a recovering java -> python dev that gets confused about straddling the paradigms.

that's a great explanation. when you get a sec, take a look at how it looks now.

The only better improvement I see is to call _share_cuda_mem before passing W and x into mne.cuda from mne.filter. Thoughts?

mne/cuda.py

larsoner · 2024-12-05T14:15:37Z

Hah, I forgot that I took out my CUDA compute GPU because I wasn't using it 😆 But @scottrbrtsn I was just going to start by testing with something really simple like the following:

$ python -m timeit -s "import mne; raw = mne.io.read_raw_fif(mne.datasets.sample.data_path() / 'MEG' / 'sample' / 'sample_audvis_raw.fif').load_data()" -n 5 "raw.copy().resample(100, n_jobs='cuda')"
5 loops, best of 5: 1.74 sec per loop

and do it first on main and then on this branch. Can you try and see if it works for you? FYI calling mne.datasets.sample.data_path() will download and extract ~2GB to ~/mne_data/MNE-sample-data if you haven't done it already.

scottrbrtsn · 2024-12-05T15:33:28Z

wilco. maybe not today. my schedule is smashed.

I think i ran the tests which pulled mne_data. I have data in that folder.

larsoner · 2024-12-05T19:25:17Z

I think i ran the tests which pulled mne_data. I have data in that folder.

That would be mne.datasets.testing (rather than mne.datasets.sample) which has some other files in it. A very similar test using those files you already have would be:

$ python -m timeit -s "import mne; raw = mne.io.read_raw_fif(mne.datasets.testing.data_path() / 'MEG' / 'sample' / 'sample_audvis_trunc_raw.fif').load_data(); raw = mne.concatenate_raws([raw.copy() for _ in range(20)])" -n 5 "raw.copy().resample(100, n_jobs='cuda')"

5 loops, best of 5: 1.26 sec per loop

scottrbrtsn · 2024-12-06T17:09:29Z

dang, I'm wondering if the arrays in those tests are too big?

320 events found on stim channel STI 014
Event IDs: [ 1  2  3  4  5 32]
CUDA not used, could not instantiate memory (arrays may be too large), falling back to n_jobs=None

Co-authored-by: Eric Larson <[email protected]>

…riately within the cuda module.

for more information, see https://pre-commit.ci

larsoner · 2024-12-06T17:16:33Z

Could be, feel free to try smaller ones if you need to. But really I'm a bit surprised because it should resample one channel at a time IIRC, and that should be < 100 MB somewhere

larsoner · 2024-12-06T17:16:59Z

... just change the range(20) in my example above to something smaller if you want

scottrbrtsn · 2024-12-06T17:24:56Z

yea this doesn't add up. My RAM doesn't spike.

I was testing larger arrays when I first started, my ram would spike to >30GB before I got GPU OOM errors.

I also tried er_noise...

Now using CUDA device 0
Enabling CUDA with 15.24 GiB available memory
CUDA not used, could not instantiate memory (arrays may be too large), falling back to n_jobs=None
CUDA not used, could not instantiate memory (arrays may be too large), falling back to n_jobs=None

scottrbrtsn · 2024-12-06T17:30:12Z

oh...i broke something that's why. 🙃

larsoner · 2024-12-06T17:34:33Z

We should probably add a _explain_exception to the logger message about not being able to instantiate memory. Probably would have saved some confusion!

scottrbrtsn · 2024-12-06T17:34:43Z

It looks like the tests have been swallowing an error all along. Not sure how long it's been here.

scottrbrtsn · 2024-12-06T17:35:39Z

Yep, something like that would help. I had to print out the exception. Cannot interpret 'Ellipsis' as a data type

…etting shared mem is necessary, for filtering...need to think on this tho

for more information, see https://pre-commit.ci

scottrbrtsn · 2024-12-06T18:33:43Z

oh my... ok yea.

the dreaded copy/paste while multitasking error. sorry.

This uncovered a different error.

when ifft gets called by tests, I "think*" 🧠 cuda memory is already allocated. therefore the dtype is different and not compatible with share_cuda_mem

my last change removed the _share_cuda_mem call. testing in progress.

scottrbrtsn · 2024-12-06T18:48:42Z

dang.
I wish I knew to run this test at the beginning.

Seems to not be any better. 😞

scottrbrtsn · 2024-12-06T18:53:11Z

I'll keep looking later. maybe size/type of the random arrays i was originally using led to false gains.

scottrbrtsn · 2024-12-06T20:12:42Z

yep.

I went back to my original test. After aligning your test and mine as close as possible, against mne's main branch, using real mne data and not random signal data, the test does not hold anymore.

sorry to distract you, I was led astray by the data I used to test.

scottrbrtsn force-pushed the cuda-zero-copy branch from 4ad40d4 to 8235d0c Compare December 3, 2024 14:55

scottrbrtsn mentioned this pull request Dec 3, 2024

ENH: Use cupy #5439

Closed

larsoner reviewed Dec 3, 2024

View reviewed changes

mne/cuda.py Outdated Show resolved Hide resolved

scottrbrtsn force-pushed the cuda-zero-copy branch from 4d4ffbe to c86bde3 Compare December 3, 2024 17:48

larsoner reviewed Dec 4, 2024

View reviewed changes

mne/cuda.py Outdated Show resolved Hide resolved

mne/cuda.py Outdated Show resolved Hide resolved

mne/cuda.py Outdated Show resolved Hide resolved

larsoner reviewed Dec 4, 2024

View reviewed changes

mne/cuda.py Outdated Show resolved Hide resolved

larsoner reviewed Dec 4, 2024

View reviewed changes

mne/cuda.py Outdated Show resolved Hide resolved

scottrbrtsn force-pushed the cuda-zero-copy branch from bbf0448 to b9cf174 Compare December 4, 2024 22:07

scottrbrtsn and others added 13 commits December 6, 2024 11:15

Update mne/cuda.py to soft import numba

929fcb9

Co-authored-by: Eric Larson <[email protected]>

revert tests, make shared_mem fun private and simplified, call approp…

214edf1

…riately within the cuda module.

[pre-commit.ci] auto fixes from pre-commit.com hooks

cfd6fb3

for more information, see https://pre-commit.ci

soft import numba

cb8fec5

remove extra line

6973b54

revert soft import.

f2950e0

[pre-commit.ci] auto fixes from pre-commit.com hooks

ac5b5c8

for more information, see https://pre-commit.ci

W needs "cuda" passed in since it just set n_jobs to 1

e76b5a7

hfft needs cuda for the same reason

1fc59ba

remove n_jobs param

9350498

remove n_jobs gate

1a1c8ec

fix docstring param

a3342fa

[pre-commit.ci] auto fixes from pre-commit.com hooks

88ae8cb

for more information, see https://pre-commit.ci

scottrbrtsn force-pushed the cuda-zero-copy branch from e809d71 to 88ae8cb Compare December 6, 2024 17:15

scottrbrtsn and others added 2 commits December 6, 2024 12:23

irfft, takes cuda mem, and so x is a different type. i do not think g…

4bbd2c7

…etting shared mem is necessary, for filtering...need to think on this tho

[pre-commit.ci] auto fixes from pre-commit.com hooks

af3cab2

for more information, see https://pre-commit.ci

remove troubleshooting woes

74a7794

scottrbrtsn closed this Dec 6, 2024

scottrbrtsn deleted the cuda-zero-copy branch December 6, 2024 20:13

		@@ -0,0 +1 @@
		Short description of the changes, by :newcontrib:`Scott Robertson`.

Cuda zero copy #13002

Cuda zero copy #13002

Conversation

scottrbrtsn commented Dec 3, 2024

Reference issue (if any)

What does this implement/fix?

Additional information

welcome bot commented Dec 3, 2024

scottrbrtsn commented Dec 3, 2024 • edited Loading

larsoner commented Dec 3, 2024

scottrbrtsn commented Dec 3, 2024

larsoner commented Dec 3, 2024

scottrbrtsn commented Dec 3, 2024

larsoner commented Dec 3, 2024

scottrbrtsn commented Dec 3, 2024

scottrbrtsn commented Dec 3, 2024

larsoner commented Dec 3, 2024

scottrbrtsn commented Dec 3, 2024

larsoner commented Dec 3, 2024

scottrbrtsn commented Dec 4, 2024 • edited Loading

scottrbrtsn commented Dec 4, 2024

scottrbrtsn commented Dec 4, 2024 • edited Loading

scottrbrtsn commented Dec 4, 2024

larsoner commented Dec 4, 2024

larsoner Dec 4, 2024

Choose a reason for hiding this comment

larsoner Dec 4, 2024

Choose a reason for hiding this comment

larsoner Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

scottrbrtsn Dec 4, 2024

Choose a reason for hiding this comment

larsoner Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

scottrbrtsn Dec 4, 2024

Choose a reason for hiding this comment

larsoner commented Dec 5, 2024

scottrbrtsn commented Dec 5, 2024

larsoner commented Dec 5, 2024

scottrbrtsn commented Dec 6, 2024 • edited Loading

larsoner commented Dec 6, 2024

larsoner commented Dec 6, 2024

scottrbrtsn commented Dec 6, 2024

scottrbrtsn commented Dec 6, 2024

larsoner commented Dec 6, 2024

scottrbrtsn commented Dec 6, 2024

scottrbrtsn commented Dec 6, 2024

scottrbrtsn commented Dec 6, 2024

scottrbrtsn commented Dec 6, 2024

scottrbrtsn commented Dec 6, 2024

scottrbrtsn commented Dec 6, 2024

scottrbrtsn commented Dec 3, 2024 •

edited

Loading

scottrbrtsn commented Dec 4, 2024 •

edited

Loading

scottrbrtsn commented Dec 4, 2024 •

edited

Loading

larsoner Dec 4, 2024 •

edited

Loading

larsoner Dec 4, 2024 •

edited

Loading

scottrbrtsn commented Dec 6, 2024 •

edited

Loading