-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cuda zero copy #13002
Cuda zero copy #13002
Conversation
Hello! 👋 Thanks for opening your first pull request here! ❤️ We will try to get back to you soon. 🚴 |
4ad40d4
to
8235d0c
Compare
Where is |
It's an optional dependency and I don't think we've added a |
Sorry I meant cupy. |
Numba is already an optional dependency. We shouldn't make it mandatory. And we should also make it so that CUDA can still be used without numba. Numba is not always easy to install... |
Got it. I saw tests failed bc numba couldn't be found. I must need to tweak something probably. Following the pattern, as it stands numba is only imported when |
You should be able to use |
4d4ffbe
to
c86bde3
Compare
Why isn't |
And then I don't see where other tests skip |
Because if you do |
aha...ok got it. If these tests are skipped, do they get run somewhere else? |
Oh, actually no tests are skipped for I think you're actually in a similar situation here where the code should run regardless of whether or not To actually know whether or not the shared memory paths are used we check the coverage, or (better) use some mne-python/mne/preprocessing/tests/test_maxwell.py Lines 1711 to 1716 in a1a05ae
|
Ok, I added a gate, to not get shared memory if cuda is not enabled. The logging is a bit less straightforward, so thinking about how to do that. I found this, seems a bit more relevant to what I need to do? mne-python/mne/tests/test_filter.py Lines 831 to 838 in a1a05ae
|
Side thought: logging in |
I opted to gate This way the test continues, and the user isn't forced to install either one (that's the goal 😄🤞). |
I see the error for I'm not sure where to put the function ref for the documentation. |
I haven't looked but I think the ones in there were chosen on purpose because they might be useful to know a bit about what is going on.
Yes based on what you're saying this sounds reasonable. I'll look at the code in a bit but hopefully can give you a quick pointer about |
@@ -0,0 +1 @@ | |||
Short description of the changes, by :newcontrib:`Scott Robertson`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just adding a comment so we don't forget to actually update this 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and adding your name to doc/changes/names.inc
will fix the CircleCI error:
[towncrier-fragments]:89: ERROR: Indirect hyperlink target "new contributor Scott Robertson" refers to target "scott robertson", which does not exist. [docutils]
mne/cuda.py
Outdated
@@ -19,6 +20,53 @@ | |||
_cuda_capable = False | |||
|
|||
|
|||
def get_shared_mem( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this needs to be public? The fewer things we need in our public API the better. Then all of this shared mem stuff can just happen automagically. So I'm inclined to say this should be _get_shared_mem
.
And really we should only add more options when they're needed, so I'm not sure we need all the strides
etc. options yet.
And to simplify things and make them more DRY, I'd be tempted to call this:
def _share_cuda_mem(x, n_jobs):
from mne.fixes import has_numba # so it can be monkey-patched in tests
if n_jobs == "cuda" and _cuda_capable and has_numba:
from numba import cuda
out = cuda.mapped_array(x.shape, ...)
out[:] = x
else:
out = x
return out
Our CIs won't complain about a new public function not being documented, and some of the code below gets simpler and more DRY because you can just do x = _share_cuda_mem(x, n_jobs)
(rather than repeat the same conditional in three places).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. I hear you on making this more DRY. I'm not sure I'm all the way there yet.
_share_cuda_mem
implies only calls from within mne.cuda
. This is fine.
However, I think, when _cuda_upload_rfft
is called by fft_resample
n_jobs' is not available.
n_jobs` has been returned as 1 (i.e. 1 cpu job, to run parallel with cuda via gpu).
And then the cleanest driest, would be for mne.filter
to _share_cuda_mem
(making it no longer a private call...)
Or I'm missing the best place for _share_cuda_mem
to be called. I'll keep thinking on it. Still spinning up on the logic flow, I feel slow, lol.
Further, _share_cuda_mem(x, n_jobs)
...could this just be _share_cuda_mem(x)
. bc, this is private, always called from within cuda, so assumed to be cuda? or do what I did, and pass in "cuda" when we're already to a point of knowing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I think, when _cuda_upload_rfft is called by fft_resample n_jobs' is not available. n_jobs` has been returned as 1 (i.e. 1 cpu job, to run parallel with cuda via gpu).
If it's inside a call that we know is in the cuda-only path then you would just call _share_cuda_mem(x, 'cuda')
And then the cleanest driest, would be for mne.filter to _share_cuda_mem (making it no longer a private call...)
This would still be private. Private vs public doesn't refer to existence in mne.filter
vs mne.cuda
, it refers to the leading underscore. mne.filter._share_cuda_mem
and mne.cuda._share_cuda_mem
are both private in the sense that they can be used inside our codebase however we need, but we can move EDIT: and change them without any API deprecation period. (Users should never use a private attribute / method / function, i.e., one with a leading _
in the name or in a so-named namespace.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy. I'm a recovering java -> python dev that gets confused about straddling the paradigms.
that's a great explanation. when you get a sec, take a look at how it looks now.
The only better improvement I see is to call _share_cuda_mem
before passing W
and x
into mne.cuda
from mne.filter
. Thoughts?
bbf0448
to
b9cf174
Compare
Hah, I forgot that I took out my CUDA compute GPU because I wasn't using it 😆 But @scottrbrtsn I was just going to start by testing with something really simple like the following:
and do it first on |
wilco. maybe not today. my schedule is smashed. I think i ran the tests which pulled mne_data. I have data in that folder. |
That would be
|
dang, I'm wondering if the arrays in those tests are too big?
|
Co-authored-by: Eric Larson <[email protected]>
…riately within the cuda module.
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
e809d71
to
88ae8cb
Compare
Could be, feel free to try smaller ones if you need to. But really I'm a bit surprised because it should resample one channel at a time IIRC, and that should be < 100 MB somewhere |
... just change the |
yea this doesn't add up. My RAM doesn't spike. I was testing larger arrays when I first started, my ram would spike to >30GB before I got GPU OOM errors. I also tried er_noise...
|
oh...i broke something that's why. 🙃 |
We should probably add a |
It looks like the tests have been swallowing an error all along. Not sure how long it's been here. |
Yep, something like that would help. I had to print out the exception. |
…etting shared mem is necessary, for filtering...need to think on this tho
for more information, see https://pre-commit.ci
oh my... ok yea. the dreaded copy/paste while multitasking error. sorry. This uncovered a different error. when ifft gets called by tests, I "think*" 🧠 cuda memory is already allocated. therefore the dtype is different and not compatible with my last change removed the |
dang. Seems to not be any better. 😞 |
I'll keep looking later. maybe size/type of the random arrays i was originally using led to false gains. |
yep. I went back to my original test. After aligning your test and mine as close as possible, against mne's main branch, using real mne data and not random signal data, the test does not hold anymore. sorry to distract you, I was led astray by the data I used to test. |
Reference issue (if any)
What does this implement/fix?
Looking through the
cuda
options for running signal transformations, I noticed the possibility of leveraging zero-copy methods for further speedup. These were adopted from cuSignal. The cuSignal README Quickstart illlustrates an example of how to allocate shared memory. This methodget_shared_memory
did not migrate intocupy
like the other methods and so I added it here formne
.Per some example benchmarks shared in the most relevante cuda issue from mne, this simple adjustment potentially cuts resampling time in half.
Additional information
mne_shared_test.py
to be the final version formne
implemented here.grep
'ed for all occurrances ofcupy.array
and am surprised there are only 3. I'm wondering if other areas ofmne
would benefit which are still just usingnp.asarray
.cupy.asarray
which will copy if shared memory is not provided, but will use shared memory if the array is already allocated on the device. This puts the burden onto the caller to provide shared memory space (as demonstrated in the tests).