Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda zero copy #13002

Closed
wants to merge 28 commits into from
Closed

Cuda zero copy #13002

wants to merge 28 commits into from

Conversation

scottrbrtsn
Copy link

Reference issue (if any)

What does this implement/fix?

Looking through the cuda options for running signal transformations, I noticed the possibility of leveraging zero-copy methods for further speedup. These were adopted from cuSignal. The cuSignal README Quickstart illlustrates an example of how to allocate shared memory. This method get_shared_memory did not migrate into cupy like the other methods and so I added it here for mne.

Per some example benchmarks shared in the most relevante cuda issue from mne, this simple adjustment potentially cuts resampling time in half.

Additional information

  1. Benchmarks are hard. I'm not 100% confident this is a guaranteed performance gain. When running cold, the difference might not be noticeable, but once the gpu is warm, subsequent runs appear to be significantly different (on my machine). My initial tests with mne_shared_test.py to be the final version for mne implemented here.
  2. I grep'ed for all occurrances of cupy.array and am surprised there are only 3. I'm wondering if other areas of mne would benefit which are still just using np.asarray.
  3. This implementation is the simple way, to offer cupy.asarray which will copy if shared memory is not provided, but will use shared memory if the array is already allocated on the device. This puts the burden onto the caller to provide shared memory space (as demonstrated in the tests).

Copy link

welcome bot commented Dec 3, 2024

Hello! 👋 Thanks for opening your first pull request here! ❤️ We will try to get back to you soon. 🚴

mne/cuda.py Outdated Show resolved Hide resolved
@scottrbrtsn
Copy link
Author

scottrbrtsn commented Dec 3, 2024

Where is cupy added as a dependency? I didn't see it in the pyproject.toml.

@larsoner
Copy link
Member

larsoner commented Dec 3, 2024

It's an optional dependency and I don't think we've added a pip install mne[cuda] though we could

@scottrbrtsn
Copy link
Author

Sorry I meant cupy.
This adds numba as a dependency. It's what is used to allocate shared memory.
Should it be added as an optional dependency?

@larsoner
Copy link
Member

larsoner commented Dec 3, 2024

This adds numba as a dependency. It's what is used to allocate shared memory.

Numba is already an optional dependency. We shouldn't make it mandatory. And we should also make it so that CUDA can still be used without numba. Numba is not always easy to install...

@scottrbrtsn
Copy link
Author

Got it. I saw tests failed bc numba couldn't be found. I must need to tweak something probably.

Following the pattern, as it stands numba is only imported when get_shared_memory is called.

@larsoner
Copy link
Member

larsoner commented Dec 3, 2024

Got it. I saw tests failed bc numba couldn't be found. I must need to tweak something probably.

You should be able to use pytest.importorskip in the tests that need it

@scottrbrtsn
Copy link
Author

Why isn't cupy skipped with pytest.importorskip?

@scottrbrtsn
Copy link
Author

And then I don't see where other tests skip numba. Sorry, I haven't skipped things like this so just trying to understand the context.

@larsoner
Copy link
Member

larsoner commented Dec 3, 2024

Because if you do n_jobs="cuda" but don't have cupy installed it is the same as just doing n_jobs=1. So no need to skip those tests, they just run (uninterestingly, redundantly) as if n_jobs=1 had been passed.

@scottrbrtsn
Copy link
Author

aha...ok got it.

If these tests are skipped, do they get run somewhere else?

@larsoner
Copy link
Member

larsoner commented Dec 3, 2024

Oh, actually no tests are skipped for numba because those also run just fine if numba is not installed (it just won't use numba for the computations)!

I think you're actually in a similar situation here where the code should run regardless of whether or not numba and/or cupy are installed. But if both are installed, it should use shared memory when possible. So no need for pytest.importorskip...

To actually know whether or not the shared memory paths are used we check the coverage, or (better) use some logger.debug statement in the shared-mem codepath then capture the logging output. If both modules are installed, make sure the shared-mem code path logger.debug line is emitted. If one or neither module is installed, assert that the logger.debug line is not in the log message. See for example how we capture then check the log messages to make sure the correct number of intervals are used in this preprocessing method:

with catch_logging() as log:
want_noisy, want_flat = find_bad_channels_maxwell(
raw.copy().crop(n / raw.info["sfreq"], None), min_count=1, verbose="debug"
)
log = log.getvalue()
assert "in 2 intervals " in log

@scottrbrtsn
Copy link
Author

scottrbrtsn commented Dec 4, 2024

Ok, I added a gate, to not get shared memory if cuda is not enabled.

The logging is a bit less straightforward, so thinking about how to do that.

I found this, seems a bit more relevant to what I need to do?

out = log_file.getvalue().split("\n")[:-1]
# triage based on whether or not we actually expected to use CUDA
from mne.cuda import _cuda_capable # allow above funs to set it
tot = 12 if _cuda_capable else 0
assert sum(["Using CUDA for FFT FIR filtering" in o for o in out]) == tot
if not _cuda_capable:
pytest.skip("CUDA not enabled")

@scottrbrtsn
Copy link
Author

Side thought: logging in mne/cuda is all at the INFO level. Should those be at the debug level?

@scottrbrtsn
Copy link
Author

scottrbrtsn commented Dec 4, 2024

I opted to gate get_shared_mem with has_numba from mne.fixes. This follows the pattern of if _cuda_capable...

This way the test continues, and the user isn't forced to install either one (that's the goal 😄🤞).

@scottrbrtsn
Copy link
Author

I see the error for doc/python_reference.rst, but I'm confused because I cannot seem to find that file...? I assumed sphinx autocreated things.

I'm not sure where to put the function ref for the documentation.

@larsoner
Copy link
Member

larsoner commented Dec 4, 2024

Side thought: logging in mne/cuda is all at the INFO level. Should those be at the debug level?

I haven't looked but I think the ones in there were chosen on purpose because they might be useful to know a bit about what is going on.

This way the test continues, and the user isn't forced to install either one (that's the goal 😄🤞).

Yes based on what you're saying this sounds reasonable. I'll look at the code in a bit but hopefully can give you a quick pointer about python_reference.rst (looking for that part now)

@@ -0,0 +1 @@
Short description of the changes, by :newcontrib:`Scott Robertson`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just adding a comment so we don't forget to actually update this 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and adding your name to doc/changes/names.inc will fix the CircleCI error:

[towncrier-fragments]:89: ERROR: Indirect hyperlink target "new contributor Scott Robertson"  refers to target "scott robertson", which does not exist. [docutils]

mne/cuda.py Outdated Show resolved Hide resolved
mne/cuda.py Outdated Show resolved Hide resolved
mne/cuda.py Show resolved Hide resolved
mne/tests/test_filter.py Outdated Show resolved Hide resolved
mne/cuda.py Outdated
@@ -19,6 +20,53 @@
_cuda_capable = False


def get_shared_mem(
Copy link
Member

@larsoner larsoner Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this needs to be public? The fewer things we need in our public API the better. Then all of this shared mem stuff can just happen automagically. So I'm inclined to say this should be _get_shared_mem.

And really we should only add more options when they're needed, so I'm not sure we need all the strides etc. options yet.

And to simplify things and make them more DRY, I'd be tempted to call this:

def _share_cuda_mem(x, n_jobs):
    from mne.fixes import has_numba  # so it can be monkey-patched in tests

    if n_jobs == "cuda" and _cuda_capable and has_numba:
        from numba import cuda
        out = cuda.mapped_array(x.shape, ...)
        out[:] = x
    else:
        out = x
    return out

Our CIs won't complain about a new public function not being documented, and some of the code below gets simpler and more DRY because you can just do x = _share_cuda_mem(x, n_jobs) (rather than repeat the same conditional in three places).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I hear you on making this more DRY. I'm not sure I'm all the way there yet.

_share_cuda_mem implies only calls from within mne.cuda. This is fine.

However, I think, when _cuda_upload_rfft is called by fft_resample n_jobs' is not available. n_jobs` has been returned as 1 (i.e. 1 cpu job, to run parallel with cuda via gpu).

And then the cleanest driest, would be for mne.filter to _share_cuda_mem (making it no longer a private call...)

Or I'm missing the best place for _share_cuda_mem to be called. I'll keep thinking on it. Still spinning up on the logic flow, I feel slow, lol.

Further, _share_cuda_mem(x, n_jobs) ...could this just be _share_cuda_mem(x). bc, this is private, always called from within cuda, so assumed to be cuda? or do what I did, and pass in "cuda" when we're already to a point of knowing?

Copy link
Member

@larsoner larsoner Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I think, when _cuda_upload_rfft is called by fft_resample n_jobs' is not available. n_jobs` has been returned as 1 (i.e. 1 cpu job, to run parallel with cuda via gpu).

If it's inside a call that we know is in the cuda-only path then you would just call _share_cuda_mem(x, 'cuda')

And then the cleanest driest, would be for mne.filter to _share_cuda_mem (making it no longer a private call...)

This would still be private. Private vs public doesn't refer to existence in mne.filter vs mne.cuda, it refers to the leading underscore. mne.filter._share_cuda_mem and mne.cuda._share_cuda_mem are both private in the sense that they can be used inside our codebase however we need, but we can move EDIT: and change them without any API deprecation period. (Users should never use a private attribute / method / function, i.e., one with a leading _ in the name or in a so-named namespace.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy. I'm a recovering java -> python dev that gets confused about straddling the paradigms.

that's a great explanation. when you get a sec, take a look at how it looks now.

The only better improvement I see is to call _share_cuda_mem before passing W and x into mne.cuda from mne.filter. Thoughts?

mne/cuda.py Outdated Show resolved Hide resolved
mne/cuda.py Outdated Show resolved Hide resolved
mne/cuda.py Outdated Show resolved Hide resolved
mne/cuda.py Outdated Show resolved Hide resolved
mne/cuda.py Outdated Show resolved Hide resolved
@larsoner
Copy link
Member

larsoner commented Dec 5, 2024

Hah, I forgot that I took out my CUDA compute GPU because I wasn't using it 😆 But @scottrbrtsn I was just going to start by testing with something really simple like the following:

$ python -m timeit -s "import mne; raw = mne.io.read_raw_fif(mne.datasets.sample.data_path() / 'MEG' / 'sample' / 'sample_audvis_raw.fif').load_data()" -n 5 "raw.copy().resample(100, n_jobs='cuda')"
5 loops, best of 5: 1.74 sec per loop

and do it first on main and then on this branch. Can you try and see if it works for you? FYI calling mne.datasets.sample.data_path() will download and extract ~2GB to ~/mne_data/MNE-sample-data if you haven't done it already.

@scottrbrtsn
Copy link
Author

wilco. maybe not today. my schedule is smashed.

I think i ran the tests which pulled mne_data. I have data in that folder.

@larsoner
Copy link
Member

larsoner commented Dec 5, 2024

I think i ran the tests which pulled mne_data. I have data in that folder.

That would be mne.datasets.testing (rather than mne.datasets.sample) which has some other files in it. A very similar test using those files you already have would be:

$ python -m timeit -s "import mne; raw = mne.io.read_raw_fif(mne.datasets.testing.data_path() / 'MEG' / 'sample' / 'sample_audvis_trunc_raw.fif').load_data(); raw = mne.concatenate_raws([raw.copy() for _ in range(20)])" -n 5 "raw.copy().resample(100, n_jobs='cuda')"

5 loops, best of 5: 1.26 sec per loop

@scottrbrtsn
Copy link
Author

scottrbrtsn commented Dec 6, 2024

dang, I'm wondering if the arrays in those tests are too big?

320 events found on stim channel STI 014
Event IDs: [ 1  2  3  4  5 32]
CUDA not used, could not instantiate memory (arrays may be too large), falling back to n_jobs=None

@larsoner
Copy link
Member

larsoner commented Dec 6, 2024

Could be, feel free to try smaller ones if you need to. But really I'm a bit surprised because it should resample one channel at a time IIRC, and that should be < 100 MB somewhere

@larsoner
Copy link
Member

larsoner commented Dec 6, 2024

... just change the range(20) in my example above to something smaller if you want

@scottrbrtsn
Copy link
Author

yea this doesn't add up. My RAM doesn't spike.

I was testing larger arrays when I first started, my ram would spike to >30GB before I got GPU OOM errors.

I also tried er_noise...

Now using CUDA device 0
Enabling CUDA with 15.24 GiB available memory
CUDA not used, could not instantiate memory (arrays may be too large), falling back to n_jobs=None
CUDA not used, could not instantiate memory (arrays may be too large), falling back to n_jobs=None

@scottrbrtsn
Copy link
Author

oh...i broke something that's why. 🙃

@larsoner
Copy link
Member

larsoner commented Dec 6, 2024

We should probably add a _explain_exception to the logger message about not being able to instantiate memory. Probably would have saved some confusion!

@scottrbrtsn
Copy link
Author

It looks like the tests have been swallowing an error all along. Not sure how long it's been here.

@scottrbrtsn
Copy link
Author

Yep, something like that would help. I had to print out the exception. Cannot interpret 'Ellipsis' as a data type

scottrbrtsn and others added 2 commits December 6, 2024 12:23
…etting shared mem is necessary, for filtering...need to think on this tho
@scottrbrtsn
Copy link
Author

oh my... ok yea.

the dreaded copy/paste while multitasking error. sorry.

This uncovered a different error.

when ifft gets called by tests, I "think*" 🧠 cuda memory is already allocated. therefore the dtype is different and not compatible with share_cuda_mem

my last change removed the _share_cuda_mem call. testing in progress.

@scottrbrtsn
Copy link
Author

dang.
I wish I knew to run this test at the beginning.

Seems to not be any better. 😞

@scottrbrtsn
Copy link
Author

I'll keep looking later. maybe size/type of the random arrays i was originally using led to false gains.

@scottrbrtsn
Copy link
Author

yep.

I went back to my original test. After aligning your test and mine as close as possible, against mne's main branch, using real mne data and not random signal data, the test does not hold anymore.

sorry to distract you, I was led astray by the data I used to test.

@scottrbrtsn scottrbrtsn closed this Dec 6, 2024
@scottrbrtsn scottrbrtsn deleted the cuda-zero-copy branch December 6, 2024 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants