-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for asynchronous memcpy #8
Comments
I'll take a stab at this. |
My current plan is to add the trait @bheisler Thoughts? |
Additionally, |
Actually, spinning off |
I would split As for |
Hmmm, I forgot how tricky async safety is. To make sure the arguments stay valid, maybe returning a promise bound to the lifetime of the passed references is the way to go? |
Yeah, this will be tricky alright. I haven't planned out a design for this. The only time we can be sure that it's safe to drop either the host-side or the device-side half of the transfer is after a I've been thinking about using the Futures API to handle asynchronous stuff safety (though I'm still fuzzy on the details) so it might be necessary to hold off on this until we figure that out some more. |
My current thought is something similar to this code. . This would also require bookkeeping in the buffers themselves to panic if the promise is dropped and then buffers used. Alternatively, we could wait longer for async/await, the futures book, and all the other async goodies, and then go for the implementation, but I think that would require the same panic bookkeeping. |
Unfortunately you can't do this. Forgetting a value is safe in Rust. Therefore you could forget the promise while the buffers are still borrowed: In rsmpi we solve this using a We also currently have an outstanding PR (that I still need to finish 😊) that allows you to attach a buffer wholesale to a request (i.e. |
Yeah, I didn't really explain the ideas for bookkeeping around if the promise is dropped at all. My bad on that. Anyways, this scope approach looks very promising! |
Yeah, that fits really well with how I was planning to handle futures. See, it's not zero-cost to create a Future tied to a CUDA stream - you have to add a
Then the If we add non-futures-based async functions, that can just be a different Now that I think about it, this would probably help solve the safety problems with Contexts as well. |
Ah, I think I understand what your saying now and think that should work. |
Cool, it works. Link. Will need to sprinkle in some unsafe black magic, so that the data can be copied back from the mutable buffer by future async_memcpy calls. |
Slight problem with that: Link. Scheduling multiple copies using the same buffer is completely safe as long as they're all on the same stream, but this implementation disallows it. |
Yeah, that's what I was getting at with the second part of my comment. My current solution is to return the references wrapped such that later async_copy calls can consume them, but they can't be de-refenced by other things.
|
I'd be very wary of unsafe black magic in this case - we could end up introducing undefined behavior while trying to hide other undefined behavior. Anyway, this is kinda what I was thinking. If you can find an ergonomic way to make it unsafe to modify the buffers while they're used in an async copy, that's great. If not, I'd be OK with just doing this even if it is slightly vulnerable to data races. |
How is pinned host memory done right now? Is that what the DeviceCopy trait indicates? |
Additionally, implementing the unsafe wrapper layer is done now, save for the test After solving that issue, next up will be trying to wrap this all safely as futures, based on our earlier discussion. |
Page-locked memory all handled by the driver. You call a certain CUDA API function to allocate and free page-locked memory. The driver tracks which memory ranges are locked and uses a fast-path for copies to/from those ranges. |
DeviceCopy is for structures that can safely be copied to the device (ie. They don't manage host-side resources or contain pointers that are only valid for the host). It has nothing to do with page-locking, pinning or anything else. |
Alright, so AsyncMemcpy requires pinned memory, but also runtime errors properly if given not page-locked memory, so we don't necessarily need to mark that in the wrapper. |
The error I was mentioning only appears when multiple tests are run at the same time. EDIT: Nevermind, it appears rarely when run alone. |
Alright to sum up my current thoughts on this.
|
Could you elaborate more on this? Why not? |
Previously, I thought |
Hey, I'm really interested in this feature! (I'm porting my hobby raytracer to rustacuda) I'd be completely fine with really low-tech solutions to this problem, just to get the feature out there:
Something that I can't seem to find any documentation on is the behavior of the driver when a buffer is freed in the middle of work. The driver may already take care of the hard parts of this - (I'd be happy to write a PR for option 1, and if you like it, a PR for option 2 given a bit of time) |
Let me finish up 1. It's pretty much done with a PR up right now, I just need to rebase it and clean a bit more, but have been slow on that because of the holidays. I'll schedule some time to finish it up by tomorrow. I'll defer to you on doing 2, since I'll be busy for a while. I think you probably want something more of the form |
See #20 for the PR I'm writing. |
Thanks for your interest, and thanks for trying RustaCUDA! Yeah, I'd be interested in pull requests, though rusch95 has already submitted a WIP PR to add an unsafe interface for async memcpy. We may have to iterate a few times to find a good balance of safety, ergonomics and performance for the safe interface. |
This is outdated information. The documentation you're referencing is for CUDA 2.3. Modern CUDA versions are able to use any type of memory (both pageable and page-locked) in cuMemcpyAsync(). The documentation makes no comment on page-locked memory anymore. In fact, I've already used pageable memory in a project before. Please do refer to, e.g., the CUDA 8.0 documentation or later. It would be unfortunate if RustaCUDA were to enforce such outdated limitations using the Rust type system. |
Previously, my test failures entirely vanished by switching from pageable
to page-locked memory, but sure, I'll look into it. I can see it possibly
resulting from some other issue that switching to page-locked fixed.
|
Digging this back up now that async is mostly stabilized to note that I'll try adding in a proper async API. |
Copying memory asynchronously allows it the memcpy to overlap with other work as long as the work doesn't depend on the copied data. This is important for optimal performance, so RustaCUDA should provide access to it.
The text was updated successfully, but these errors were encountered: