Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for asynchronous memcpy #8

Open
bheisler opened this issue Nov 25, 2018 · 32 comments
Open

Add support for asynchronous memcpy #8

bheisler opened this issue Nov 25, 2018 · 32 comments
Labels
Bigger Project A larger project, typically involving designing a safe, Rust-y wrapper around a CUDA api concept. New CUDA Feature Expose a new CUDA feature through RustaCUDA

Comments

@bheisler
Copy link
Owner

Copying memory asynchronously allows it the memcpy to overlap with other work as long as the work doesn't depend on the copied data. This is important for optimal performance, so RustaCUDA should provide access to it.

@bheisler bheisler added New CUDA Feature Expose a new CUDA feature through RustaCUDA Bigger Project A larger project, typically involving designing a safe, Rust-y wrapper around a CUDA api concept. labels Nov 25, 2018
@rusch95
Copy link
Contributor

rusch95 commented Dec 6, 2018

I'll take a stab at this.

@rusch95
Copy link
Contributor

rusch95 commented Dec 7, 2018

My current plan is to add the trait AsyncCopyDestination which would have async_copy_from and async_copy_to. My other thought was implementing a version of CopyDestination that is async, but that precludes doing both sync and async copying.

@bheisler Thoughts?

@rusch95
Copy link
Contributor

rusch95 commented Dec 7, 2018

Additionally, device.rs is getting rather large (~1200 lines), so I'd like your thoughts on splitting it into sync.rs for synchronous memory transfer functions and tests, async.rs for asynchronous functions and tests, and then device.rs for the rest.

@rusch95
Copy link
Contributor

rusch95 commented Dec 7, 2018

Actually, spinning off DeviceBox, DeviceSlice, DeviceChunks, and DeviceBuffer into their own files, if possible, would probably be cleaner.

@bheisler
Copy link
Owner Author

bheisler commented Dec 7, 2018

I would split device.rs into three files for DeviceBox, DeviceSlice/DeviceChunks and DeviceBuffer.

As for CopyDestination - I'm still not entirely sold on the current design for that trait. Having both copy_from and copy_to on the same trait seems unnecessary. Anyway, fixing that (if I do fix that) would be another issue. Yeah, an AsyncCopyDestination seems like a good way to go.

@rusch95
Copy link
Contributor

rusch95 commented Dec 7, 2018

Hmmm, I forgot how tricky async safety is. To make sure the arguments stay valid, maybe returning a promise bound to the lifetime of the passed references is the way to go?

@bheisler
Copy link
Owner Author

bheisler commented Dec 7, 2018

Yeah, this will be tricky alright. I haven't planned out a design for this. The only time we can be sure that it's safe to drop either the host-side or the device-side half of the transfer is after a synchronize (and it has to synchronize on the same stream as well).

I've been thinking about using the Futures API to handle asynchronous stuff safety (though I'm still fuzzy on the details) so it might be necessary to hold off on this until we figure that out some more.

@rusch95
Copy link
Contributor

rusch95 commented Dec 8, 2018

My current thought is something similar to this code. .

This would also require bookkeeping in the buffers themselves to panic if the promise is dropped and then buffers used.

Alternatively, we could wait longer for async/await, the futures book, and all the other async goodies, and then go for the implementation, but I think that would require the same panic bookkeeping.

@AndrewGaspar
Copy link
Contributor

AndrewGaspar commented Dec 8, 2018

Unfortunately you can't do this. Forgetting a value is safe in Rust. Therefore you could forget the promise while the buffers are still borrowed:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=26a7e7d4bb0a348ca05cc210e7c3962a

In rsmpi we solve this using a scope concept. See the example in the README. You can also look at the documentation. Essentially, a scope object has a shorter lifetime than all buffers used by the scope, and only a reference to the scope is given to user code, meaning it can't be forgotten. New Requests (essentially promises in MPI) are attached to a scope. When the scope lambda ends, it panics if any Requests attached the scope were not completed.

We also currently have an outstanding PR (that I still need to finish 😊) that allows you to attach a buffer wholesale to a request (i.e. Vec<T>, Box<[T]>, etc.). That's another thing you could allow.

@rusch95
Copy link
Contributor

rusch95 commented Dec 8, 2018

Yeah, I didn't really explain the ideas for bookkeeping around if the promise is dropped at all. My bad on that.

Anyways, this scope approach looks very promising!

@bheisler
Copy link
Owner Author

bheisler commented Dec 8, 2018

Yeah, that fits really well with how I was planning to handle futures.

See, it's not zero-cost to create a Future tied to a CUDA stream - you have to add a CUevent to the stream that can be polled and probably queue up a callback to wake up the future. I don't think most users will want to poll on every asynchronous task, so I figured I'd do something like this:

let stream = ...;
// Need better names
let future = stream.submit(|executor| {
    // Submit a bunch of async work using the executor
    Ok(())
})?;

Then the submit function would return a Future that contained the closure (to keep the references alive). On the first poll (or maybe immediately? Not sure which is best) it would execute the closure to submit the work, submit the event and callback and prepare to poll the event to wait until the work is complete.

If we add non-futures-based async functions, that can just be a different submit function that synchronize()'s to block after calling the closure.

Now that I think about it, this would probably help solve the safety problems with Contexts as well.

@rusch95
Copy link
Contributor

rusch95 commented Dec 9, 2018

Ah, I think I understand what your saying now and think that should work.

@rusch95
Copy link
Contributor

rusch95 commented Dec 9, 2018

Cool, it works. Link.

Will need to sprinkle in some unsafe black magic, so that the data can be copied back from the mutable buffer by future async_memcpy calls.

@bheisler
Copy link
Owner Author

bheisler commented Dec 9, 2018

Slight problem with that: Link.

Scheduling multiple copies using the same buffer is completely safe as long as they're all on the same stream, but this implementation disallows it.

@rusch95
Copy link
Contributor

rusch95 commented Dec 9, 2018

Yeah, that's what I was getting at with the second part of my comment.

My current solution is to return the references wrapped such that later async_copy calls can consume them, but they can't be de-refenced by other things.

let (_, mid_ref) = executor.copy(start_ref, mid_ref);
executor.copy(mid_ref, end_ref);

@bheisler
Copy link
Owner Author

bheisler commented Dec 9, 2018

I'd be very wary of unsafe black magic in this case - we could end up introducing undefined behavior while trying to hide other undefined behavior.

Anyway, this is kinda what I was thinking. If you can find an ergonomic way to make it unsafe to modify the buffers while they're used in an async copy, that's great. If not, I'd be OK with just doing this even if it is slightly vulnerable to data races.

@rusch95
Copy link
Contributor

rusch95 commented Dec 10, 2018

How is pinned host memory done right now? Is that what the DeviceCopy trait indicates?

@rusch95
Copy link
Contributor

rusch95 commented Dec 10, 2018

Additionally, implementing the unsafe wrapper layer is done now, save for the test async_copy_device_to_host not actually synchronizing for some reason and failing some of the time. I think the stack is pinned, so I don't think that is the cause.

After solving that issue, next up will be trying to wrap this all safely as futures, based on our earlier discussion.

@bheisler
Copy link
Owner Author

Page-locked memory all handled by the driver. You call a certain CUDA API function to allocate and free page-locked memory. The driver tracks which memory ranges are locked and uses a fast-path for copies to/from those ranges.

@bheisler
Copy link
Owner Author

DeviceCopy is for structures that can safely be copied to the device (ie. They don't manage host-side resources or contain pointers that are only valid for the host). It has nothing to do with page-locking, pinning or anything else.

@rusch95
Copy link
Contributor

rusch95 commented Dec 10, 2018

Alright, so AsyncMemcpy requires pinned memory, but also runtime errors properly if given not page-locked memory, so we don't necessarily need to mark that in the wrapper.

@rusch95
Copy link
Contributor

rusch95 commented Dec 10, 2018

The error I was mentioning only appears when multiple tests are run at the same time.

EDIT: Nevermind, it appears rarely when run alone.

@rusch95
Copy link
Contributor

rusch95 commented Dec 11, 2018

Alright to sum up my current thoughts on this.

  1. AsyncMemcpy requires locked memory, thus we can async copy from and to devices using only DeviceFoo, UnifiedFoo, and LockedFoo.
  2. AsyncMemcpy cannot take T: DeviceCopy anymore. Thus, providing safety by forcing the arguments to be of a certain form, specifically the form that is protected R/W lock, isn't too horrible a choice now, especially considering that most workloads will be large arrays. However, we still might be able to use scoping and the borrow checker to get rid of that overhead, in exchange for a little more verbose syntax.
  3. For seeing when the AsyncMemcpy's are done, I think Futures are still a nice thing to provide. Calling context.sync() or stream.sync() will probably be useful for proof of concepts and certain workflows, so I also want that as an optional, though it might provide only runtime, instead of compile time safety.

@bheisler
Copy link
Owner Author

AsyncMemcpy cannot take T: DeviceCopy anymore.

Could you elaborate more on this? Why not?

@rusch95
Copy link
Contributor

rusch95 commented Dec 12, 2018

Previously, I thought AsyncCopyDestination might be able to take T: DeviceCopy references as values for copy_from and copy_to, just like CopyDestination. However, it appears not using page-locked memory will throw errors according to the documentation. I haven't gotten these errors, only silent failures, which is strange, so I might write some more test cases to chase down what's going on there.

@khyperia
Copy link

Hey, I'm really interested in this feature! (I'm porting my hobby raytracer to rustacuda)

I'd be completely fine with really low-tech solutions to this problem, just to get the feature out there:

  1. Just make copy_{from,to}_async unsafe, and point out in documentation "hey, don't free this memory"
  2. Make copy_{from,to}_async, take an Arc<LockedBuffer>. Then, inside, add the Arc to a Vec that gets cleared whenever the user calls synchronize or when any (possibly-user-provided) CUevent callback fires (in that case, only clear ones before the event was queued).

Something that I can't seem to find any documentation on is the behavior of the driver when a buffer is freed in the middle of work. The driver may already take care of the hard parts of this - cuMemFreeHost may not actually free the buffer until work is complete. In that case, we wouldn't have to worry about any of this.

(I'd be happy to write a PR for option 1, and if you like it, a PR for option 2 given a bit of time)

@rusch95
Copy link
Contributor

rusch95 commented Dec 23, 2018

Let me finish up 1. It's pretty much done with a PR up right now, I just need to rebase it and clean a bit more, but have been slow on that because of the holidays. I'll schedule some time to finish it up by tomorrow.

I'll defer to you on doing 2, since I'll be busy for a while. I think you probably want something more of the form Arc<RWLock<LockedBuffer>> though for the safe form, where the locks held are released after synchronization.

@rusch95
Copy link
Contributor

rusch95 commented Dec 23, 2018

See #20 for the PR I'm writing.

@bheisler
Copy link
Owner Author

Hey, I'm really interested in this feature! (I'm porting my hobby raytracer to rustacuda)

Thanks for your interest, and thanks for trying RustaCUDA! Yeah, I'd be interested in pull requests, though rusch95 has already submitted a WIP PR to add an unsafe interface for async memcpy. We may have to iterate a few times to find a good balance of safety, ergonomics and performance for the safe interface.

@LutzCle
Copy link
Contributor

LutzCle commented Mar 4, 2019

Previously, I thought AsyncCopyDestination might be able to take T: DeviceCopy references as values for copy_from and copy_to, just like CopyDestination. However, it appears not using page-locked memory will throw errors according to the documentation. I haven't gotten these errors, only silent failures, which is strange, so I might write some more test cases to chase down what's going on there.

This is outdated information. The documentation you're referencing is for CUDA 2.3.

Modern CUDA versions are able to use any type of memory (both pageable and page-locked) in cuMemcpyAsync(). The documentation makes no comment on page-locked memory anymore. In fact, I've already used pageable memory in a project before.

Please do refer to, e.g., the CUDA 8.0 documentation or later. It would be unfortunate if RustaCUDA were to enforce such outdated limitations using the Rust type system.

@rusch95
Copy link
Contributor

rusch95 commented Mar 4, 2019 via email

@rusch95
Copy link
Contributor

rusch95 commented Oct 6, 2019

Digging this back up now that async is mostly stabilized to note that I'll try adding in a proper async API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bigger Project A larger project, typically involving designing a safe, Rust-y wrapper around a CUDA api concept. New CUDA Feature Expose a new CUDA feature through RustaCUDA
Projects
None yet
Development

No branches or pull requests

5 participants