Add support for asynchronous memcpy #8

bheisler · 2018-11-25T19:22:26Z

Copying memory asynchronously allows it the memcpy to overlap with other work as long as the work doesn't depend on the copied data. This is important for optimal performance, so RustaCUDA should provide access to it.

rusch95 · 2018-12-06T21:37:18Z

I'll take a stab at this.

rusch95 · 2018-12-07T01:42:14Z

My current plan is to add the trait AsyncCopyDestination which would have async_copy_from and async_copy_to. My other thought was implementing a version of CopyDestination that is async, but that precludes doing both sync and async copying.

@bheisler Thoughts?

rusch95 · 2018-12-07T01:55:33Z

Additionally, device.rs is getting rather large (~1200 lines), so I'd like your thoughts on splitting it into sync.rs for synchronous memory transfer functions and tests, async.rs for asynchronous functions and tests, and then device.rs for the rest.

rusch95 · 2018-12-07T06:24:22Z

Actually, spinning off DeviceBox, DeviceSlice, DeviceChunks, and DeviceBuffer into their own files, if possible, would probably be cleaner.

bheisler · 2018-12-07T12:28:29Z

I would split device.rs into three files for DeviceBox, DeviceSlice/DeviceChunks and DeviceBuffer.

As for CopyDestination - I'm still not entirely sold on the current design for that trait. Having both copy_from and copy_to on the same trait seems unnecessary. Anyway, fixing that (if I do fix that) would be another issue. Yeah, an AsyncCopyDestination seems like a good way to go.

rusch95 · 2018-12-07T20:57:25Z

Hmmm, I forgot how tricky async safety is. To make sure the arguments stay valid, maybe returning a promise bound to the lifetime of the passed references is the way to go?

bheisler · 2018-12-07T21:18:26Z

Yeah, this will be tricky alright. I haven't planned out a design for this. The only time we can be sure that it's safe to drop either the host-side or the device-side half of the transfer is after a synchronize (and it has to synchronize on the same stream as well).

I've been thinking about using the Futures API to handle asynchronous stuff safety (though I'm still fuzzy on the details) so it might be necessary to hold off on this until we figure that out some more.

rusch95 · 2018-12-08T14:13:43Z

My current thought is something similar to this code. .

This would also require bookkeeping in the buffers themselves to panic if the promise is dropped and then buffers used.

Alternatively, we could wait longer for async/await, the futures book, and all the other async goodies, and then go for the implementation, but I think that would require the same panic bookkeeping.

AndrewGaspar · 2018-12-08T19:16:13Z

Unfortunately you can't do this. Forgetting a value is safe in Rust. Therefore you could forget the promise while the buffers are still borrowed:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=26a7e7d4bb0a348ca05cc210e7c3962a

In rsmpi we solve this using a scope concept. See the example in the README. You can also look at the documentation. Essentially, a scope object has a shorter lifetime than all buffers used by the scope, and only a reference to the scope is given to user code, meaning it can't be forgotten. New Requests (essentially promises in MPI) are attached to a scope. When the scope lambda ends, it panics if any Requests attached the scope were not completed.

We also currently have an outstanding PR (that I still need to finish 😊) that allows you to attach a buffer wholesale to a request (i.e. Vec<T>, Box<[T]>, etc.). That's another thing you could allow.

rusch95 · 2018-12-08T20:05:52Z

Yeah, I didn't really explain the ideas for bookkeeping around if the promise is dropped at all. My bad on that.

Anyways, this scope approach looks very promising!

bheisler · 2018-12-08T20:10:26Z

Yeah, that fits really well with how I was planning to handle futures.

See, it's not zero-cost to create a Future tied to a CUDA stream - you have to add a CUevent to the stream that can be polled and probably queue up a callback to wake up the future. I don't think most users will want to poll on every asynchronous task, so I figured I'd do something like this:

let stream = ...;
// Need better names
let future = stream.submit(|executor| {
    // Submit a bunch of async work using the executor
    Ok(())
})?;

Then the submit function would return a Future that contained the closure (to keep the references alive). On the first poll (or maybe immediately? Not sure which is best) it would execute the closure to submit the work, submit the event and callback and prepare to poll the event to wait until the work is complete.

If we add non-futures-based async functions, that can just be a different submit function that synchronize()'s to block after calling the closure.

Now that I think about it, this would probably help solve the safety problems with Contexts as well.

rusch95 · 2018-12-09T04:43:58Z

Ah, I think I understand what your saying now and think that should work.

rusch95 · 2018-12-09T06:19:54Z

Cool, it works. Link.

Will need to sprinkle in some unsafe black magic, so that the data can be copied back from the mutable buffer by future async_memcpy calls.

bheisler · 2018-12-09T15:15:38Z

Slight problem with that: Link.

Scheduling multiple copies using the same buffer is completely safe as long as they're all on the same stream, but this implementation disallows it.

rusch95 · 2018-12-09T17:38:33Z

Yeah, that's what I was getting at with the second part of my comment.

My current solution is to return the references wrapped such that later async_copy calls can consume them, but they can't be de-refenced by other things.

let (_, mid_ref) = executor.copy(start_ref, mid_ref);
executor.copy(mid_ref, end_ref);

bheisler · 2018-12-09T18:58:14Z

I'd be very wary of unsafe black magic in this case - we could end up introducing undefined behavior while trying to hide other undefined behavior.

Anyway, this is kinda what I was thinking. If you can find an ergonomic way to make it unsafe to modify the buffers while they're used in an async copy, that's great. If not, I'd be OK with just doing this even if it is slightly vulnerable to data races.

rusch95 · 2018-12-10T22:08:10Z

How is pinned host memory done right now? Is that what the DeviceCopy trait indicates?

rusch95 · 2018-12-10T22:16:36Z

Additionally, implementing the unsafe wrapper layer is done now, save for the test async_copy_device_to_host not actually synchronizing for some reason and failing some of the time. I think the stack is pinned, so I don't think that is the cause.

After solving that issue, next up will be trying to wrap this all safely as futures, based on our earlier discussion.

bheisler · 2018-12-10T22:30:22Z

Page-locked memory all handled by the driver. You call a certain CUDA API function to allocate and free page-locked memory. The driver tracks which memory ranges are locked and uses a fast-path for copies to/from those ranges.

bheisler · 2018-12-10T22:33:02Z

DeviceCopy is for structures that can safely be copied to the device (ie. They don't manage host-side resources or contain pointers that are only valid for the host). It has nothing to do with page-locking, pinning or anything else.

rusch95 · 2018-12-10T22:34:26Z

Alright, so AsyncMemcpy requires pinned memory, but also runtime errors properly if given not page-locked memory, so we don't necessarily need to mark that in the wrapper.

rusch95 · 2018-12-10T22:36:04Z

The error I was mentioning only appears when multiple tests are run at the same time.

EDIT: Nevermind, it appears rarely when run alone.

rusch95 · 2018-12-11T19:35:57Z

Alright to sum up my current thoughts on this.

AsyncMemcpy requires locked memory, thus we can async copy from and to devices using only DeviceFoo, UnifiedFoo, and LockedFoo.
AsyncMemcpy cannot take T: DeviceCopy anymore. Thus, providing safety by forcing the arguments to be of a certain form, specifically the form that is protected R/W lock, isn't too horrible a choice now, especially considering that most workloads will be large arrays. However, we still might be able to use scoping and the borrow checker to get rid of that overhead, in exchange for a little more verbose syntax.
For seeing when the AsyncMemcpy's are done, I think Futures are still a nice thing to provide. Calling context.sync() or stream.sync() will probably be useful for proof of concepts and certain workflows, so I also want that as an optional, though it might provide only runtime, instead of compile time safety.

bheisler · 2018-12-12T02:28:42Z

AsyncMemcpy cannot take T: DeviceCopy anymore.

Could you elaborate more on this? Why not?

rusch95 · 2018-12-12T10:09:05Z

Previously, I thought AsyncCopyDestination might be able to take T: DeviceCopy references as values for copy_from and copy_to, just like CopyDestination. However, it appears not using page-locked memory will throw errors according to the documentation. I haven't gotten these errors, only silent failures, which is strange, so I might write some more test cases to chase down what's going on there.

khyperia · 2018-12-23T03:58:04Z

Hey, I'm really interested in this feature! (I'm porting my hobby raytracer to rustacuda)

I'd be completely fine with really low-tech solutions to this problem, just to get the feature out there:

Just make copy_{from,to}_async unsafe, and point out in documentation "hey, don't free this memory"
Make copy_{from,to}_async, take an Arc<LockedBuffer>. Then, inside, add the Arc to a Vec that gets cleared whenever the user calls synchronize or when any (possibly-user-provided) CUevent callback fires (in that case, only clear ones before the event was queued).

Something that I can't seem to find any documentation on is the behavior of the driver when a buffer is freed in the middle of work. The driver may already take care of the hard parts of this - cuMemFreeHost may not actually free the buffer until work is complete. In that case, we wouldn't have to worry about any of this.

(I'd be happy to write a PR for option 1, and if you like it, a PR for option 2 given a bit of time)

rusch95 · 2018-12-23T04:22:14Z

Let me finish up 1. It's pretty much done with a PR up right now, I just need to rebase it and clean a bit more, but have been slow on that because of the holidays. I'll schedule some time to finish it up by tomorrow.

I'll defer to you on doing 2, since I'll be busy for a while. I think you probably want something more of the form Arc<RWLock<LockedBuffer>> though for the safe form, where the locks held are released after synchronization.

rusch95 · 2018-12-23T04:23:03Z

See #20 for the PR I'm writing.

bheisler · 2018-12-23T17:17:21Z

Hey, I'm really interested in this feature! (I'm porting my hobby raytracer to rustacuda)

Thanks for your interest, and thanks for trying RustaCUDA! Yeah, I'd be interested in pull requests, though rusch95 has already submitted a WIP PR to add an unsafe interface for async memcpy. We may have to iterate a few times to find a good balance of safety, ergonomics and performance for the safe interface.

LutzCle · 2019-03-04T17:43:20Z

Previously, I thought AsyncCopyDestination might be able to take T: DeviceCopy references as values for copy_from and copy_to, just like CopyDestination. However, it appears not using page-locked memory will throw errors according to the documentation. I haven't gotten these errors, only silent failures, which is strange, so I might write some more test cases to chase down what's going on there.

This is outdated information. The documentation you're referencing is for CUDA 2.3.

Modern CUDA versions are able to use any type of memory (both pageable and page-locked) in cuMemcpyAsync(). The documentation makes no comment on page-locked memory anymore. In fact, I've already used pageable memory in a project before.

Please do refer to, e.g., the CUDA 8.0 documentation or later. It would be unfortunate if RustaCUDA were to enforce such outdated limitations using the Rust type system.

rusch95 · 2019-03-04T18:58:12Z

Previously, my test failures entirely vanished by switching from pageable to page-locked memory, but sure, I'll look into it. I can see it possibly resulting from some other issue that switching to page-locked fixed.

rusch95 · 2019-10-06T16:22:32Z

Digging this back up now that async is mostly stabilized to note that I'll try adding in a proper async API.

bheisler added New CUDA Feature Expose a new CUDA feature through RustaCUDA Bigger Project A larger project, typically involving designing a safe, Rust-y wrapper around a CUDA api concept. labels Nov 25, 2018

bheisler mentioned this issue Nov 25, 2018

Expose the rest of the Driver API #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for asynchronous memcpy #8

Add support for asynchronous memcpy #8

bheisler commented Nov 25, 2018

rusch95 commented Dec 6, 2018

rusch95 commented Dec 7, 2018 •

edited

Loading

rusch95 commented Dec 7, 2018 •

edited

Loading

rusch95 commented Dec 7, 2018

bheisler commented Dec 7, 2018

rusch95 commented Dec 7, 2018

bheisler commented Dec 7, 2018

rusch95 commented Dec 8, 2018

AndrewGaspar commented Dec 8, 2018 •

edited

Loading

rusch95 commented Dec 8, 2018 •

edited

Loading

bheisler commented Dec 8, 2018 •

edited

Loading

rusch95 commented Dec 9, 2018

rusch95 commented Dec 9, 2018

bheisler commented Dec 9, 2018

rusch95 commented Dec 9, 2018

bheisler commented Dec 9, 2018

rusch95 commented Dec 10, 2018 •

edited

Loading

rusch95 commented Dec 10, 2018

bheisler commented Dec 10, 2018

bheisler commented Dec 10, 2018

rusch95 commented Dec 10, 2018

rusch95 commented Dec 10, 2018 •

edited

Loading

rusch95 commented Dec 11, 2018

bheisler commented Dec 12, 2018

rusch95 commented Dec 12, 2018 •

edited

Loading

khyperia commented Dec 23, 2018

rusch95 commented Dec 23, 2018

rusch95 commented Dec 23, 2018

bheisler commented Dec 23, 2018

LutzCle commented Mar 4, 2019

rusch95 commented Mar 4, 2019 via email

rusch95 commented Oct 6, 2019 •

edited

Loading

Add support for asynchronous memcpy #8

Add support for asynchronous memcpy #8

Comments

bheisler commented Nov 25, 2018

rusch95 commented Dec 6, 2018

rusch95 commented Dec 7, 2018 • edited Loading

rusch95 commented Dec 7, 2018 • edited Loading

rusch95 commented Dec 7, 2018

bheisler commented Dec 7, 2018

rusch95 commented Dec 7, 2018

bheisler commented Dec 7, 2018

rusch95 commented Dec 8, 2018

AndrewGaspar commented Dec 8, 2018 • edited Loading

rusch95 commented Dec 8, 2018 • edited Loading

bheisler commented Dec 8, 2018 • edited Loading

rusch95 commented Dec 9, 2018

rusch95 commented Dec 9, 2018

bheisler commented Dec 9, 2018

rusch95 commented Dec 9, 2018

bheisler commented Dec 9, 2018

rusch95 commented Dec 10, 2018 • edited Loading

rusch95 commented Dec 10, 2018

bheisler commented Dec 10, 2018

bheisler commented Dec 10, 2018

rusch95 commented Dec 10, 2018

rusch95 commented Dec 10, 2018 • edited Loading

rusch95 commented Dec 11, 2018

bheisler commented Dec 12, 2018

rusch95 commented Dec 12, 2018 • edited Loading

khyperia commented Dec 23, 2018

rusch95 commented Dec 23, 2018

rusch95 commented Dec 23, 2018

bheisler commented Dec 23, 2018

LutzCle commented Mar 4, 2019

rusch95 commented Mar 4, 2019 via email

rusch95 commented Oct 6, 2019 • edited Loading

rusch95 commented Dec 7, 2018 •

edited

Loading

rusch95 commented Dec 7, 2018 •

edited

Loading

AndrewGaspar commented Dec 8, 2018 •

edited

Loading

rusch95 commented Dec 8, 2018 •

edited

Loading

bheisler commented Dec 8, 2018 •

edited

Loading

rusch95 commented Dec 10, 2018 •

edited

Loading

rusch95 commented Dec 10, 2018 •

edited

Loading

rusch95 commented Dec 12, 2018 •

edited

Loading

rusch95 commented Oct 6, 2019 •

edited

Loading