Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[abandoned] Allowing zerocopy: makeAvailable(queue, dst, srcView) #1820

Draft
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

bernhardmgruber
Copy link
Member

@bernhardmgruber bernhardmgruber commented Oct 29, 2022

Adds an overload set zero_memcpy that only copies a buffer if the destination device requires the buffer in a different memory space in order to access it. Otherwise, no copy is performed and just the handles are adjusted.

Motivating use case (for me):
Avoid duplicating data if the accelerator device is the same as the host device. This is the case in the standard workflow where we create a buffer on the host, then a buffer on the accelerator device, and then copy from host -> device. If e.g. the TBB or OpenMP backend is active, there is no need for two buffers and an actual memory copy.

This zero copy idea can be extended further to also include various other technologies like CUDA's USM (managed memory), or remove memory access etc. For now, I do not have the time to work out all those details, so I am just proposing a minimal API to accomplish my motivating use case.

Example randomCells2D is rewritten to use the new API.

Addresses part of: #28

TODO:

  • Unit tests
  • Cheatsheet
  • Documentation?
  • Should this be an experimental feature? (namespace alpaka::experimental)

Comment on lines 221 to 228
BufAcc bufAccS{alpaka::zero_memcpy(queue, bufHostS)};
float* const ptrBufAccS{alpaka::getPtrNative(bufAccS)};
auto pitchBufAccS = alpaka::getPitchBytes<1u>(bufAccS);
alpaka::memcpy(queue, bufAccS, bufHostS);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buffer creation is moved down to where the memory copies were performed, because allocation/memcpy is now one potential step hidden behind zero_memcpy.

Comment on lines 207 to 211
//! Makes the content of the source view available on the device associated with the destination queue. If the
//! destination shares the same memory space as the source view, no copy is performed and the destination view is
//! updated to share the same buffer as the source view. Otherwise, a memcpy is performed.
template<typename TQueue, typename TViewDst, typename TViewSrc>
ALPAKA_FN_HOST void zero_memcpy(TQueue& dstQueue, TViewDst& viewDstOpt, TViewSrc const& viewSrc)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version is used when the user already has a destination buffer. Typically, the case when transferring back from the device to the host buffer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our offline discussion yesterday: IMO TQueue& dstQueue must be the source queue because you should not provide the control back to the caller before the memory operations on the source device are finished.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of two possible use cases:

  • "transfer" data from host to device
  • "transfer" data from device to host

where device could be any device for which we do not want to make the copy, but take the zero-copy path; it could be from a CPU backend, or a GPU backend with unified memory, etc.

In both cases I would expect to use a queue associated to the device, not to the host, same as for the regular memcpy function.

So, maybe this should simply be TQueue queue ?

Copy link
Member

@psychocoderHPC psychocoderHPC Jan 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fwyzard I think it must always be the source queue because there zero copy operation is instantly finished, so it takes no time but it is not allowed to use viewDstOpt before all operations on viewSrc are finished. To guard this case you need to wait for the source queue. In cases we would introduce a zero_memcpy_async we would need to return an event that provides the information when we can use viewDstOpt.
Even for host to device we need the source queue (host queue) because the user can run a host-accelerated kernel with viewSrc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhh I am a little bit confused the function is called zero_memcpy but is performing a copy in case the memory is not available on the destination device. So it is NOT zero-memcopy, it is some kind of an optimized (intelligent) copy.

In that case I think we need the source and destination queue to gard when the operation starts in case it is not zero copiable and to secure that follow-up operations to the destination not start before the possible copy is finished.

IMO the function should not be called zero_memcpy or I have another definition of zero memcopy in mind.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In both cases I would expect to use a queue associated to the device, not to the host, same as for the regular memcpy function.
So, maybe this should simply be TQueue queue ?

Exactly, so I renamed it yesterday like so, but GitHub is only showing Outdated instead of the actual state of the PR.

it is not allowed to use viewDstOpt before all operations on viewSrc are finished. To guard this case you need to wait for the source queue.

So how do we guard against this in the standard alpaka::memcpy(queue, dstView, srcView)? Isn't it the same problem here? So I wondered what the meaning of this queue actually is, and whether it needs to have any relation to the views. Looking into the implementation, there is actually no requirement on the queue related to the views. For CUDA, the associated devices of the views determine whether the memcpy kind is host-2-device, device-2-device, etc., but I could apparently use a queue to a device A and then memcpy between two views on different devices B and C. This may be a design bug, but it leads me to believe that the queue's meaning is to allow the user to sequence operations with respect to other operations. And if the user gets that wrong, then that's the users fault.

Even for host to device we need the source queue (host queue) because the user can run a host-accelerated kernel with viewSrc.

Could you point me to an example where we actually have a queue to the host device as well (in addition to the accelerator device) and how that would be used? This is the first time I hear of this use.

Mhh I am a little bit confused the function is called zero_memcpy but is performing a copy in case the memory is not available on the destination device.

Right, which is why I chose the provocative placeholder name zero_memcpy. We still need to have that discussion on how to name this functionality.

I was tempted yesterday to define zero-copy as: zero_memcpy(queue, dev, srcView) -> optional<DstView> or even omit the queue parameter altogether, because there is no real work necessary in case the view is also accessible on the destination device. However, this is not the API I would want to have as a user and I would immediately need to implement:

if (auto dst = zero_memcpy(queue, dev, srcView))
    return *dst;
else {
    auto dst = allocBuf(...);
    memcpy(queue, dst, src);
}

I think @j-stephan suggested in the past to use make_available and @fwyzard suggested to use convey or smart_memcpy in #28.

In that case I think we need the source and destination queue to gard when the operation starts in case it is not zero copiable and to secure that follow-up operations to the destination not start before the possible copy is finished.

Again, this is the users responsibilty to use the queues correctly. If you create two queues and use two views on both of the queues, you can also create a race with just the normal alpaka::memcpy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you point me to an example where we actually have a queue to the host device as well (in addition to the accelerator device) and how that would be used? This is the first time I hear of this use.

I would say we do not have such an example because we do not use accelerators colocated in one application.
As I know we are very "dirty" in all our examples and run host code without a use of an accelerator because we run the host code sequentially.
Even in PIConGPU we use native OpenMP where normally an accelerator would be the better way to run parallel host code to re-use all device functions we implemented. The problem with PIConGPU is that we use cupla where we can not use two accelerators at the same time.
To be fair but this is one reason why we made alpaka but we never created an example :-(
Our tests are using multiple accelerators in one test-app but not cooperative as I know :-(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I know we are very "dirty" in all our examples and run host code without a use of an accelerator because we run the host code sequentially.

I don't think this is dirty. Alpaka should just coexist with ordinary C++. If I use a std::for_each(std::execution::par, ...) or a std::jthread next to an alpaka kernel call, this should be allowed (given that I also get all my synchronization right).

Our tests are using multiple accelerators in one test-app but not cooperative as I know :-(

Yeah, I think we should have a test/example for that, see #1909.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should co-exists with "normal" C++. Nver the less all our examples assume we do not need paralism on the host side and we simply use a sequential for loop e.g. to evaluate results instead of a kernel that can do it in parallel.

Comment on lines 231 to 235
//! Makes the content of the source view available on the device associated with the destination queue. If the
//! destination shares the same memory space as the source view, no copy is performed and the source view is
//! returned. Otherwise
template<typename TQueue, typename TViewSrc>
ALPAKA_FN_HOST auto zero_memcpy(TQueue& dstQueue, TViewSrc const& viewSrc)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version is used when the user does not have a destination buffer yet. Typically, the case when transferring from the host to a (potentially newly allocated) device buffer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I had to add another overload so the user could specify the destination device, because this is not necessarily the same as the queue device.

return viewSrc;
}

auto dst = allocAsyncBufIfSupported<Elem<TViewSrc>, Idx<TViewSrc>>(dstQueue, getExtentVec(viewSrc));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What strategy should I use to allocate a new buffer here?

@bernhardmgruber
Copy link
Member Author

For now I implemented the new routines for buffers. I wondered whether it makes any sense to allow more generic views as well. I guess not, because views have no device associated with them?

namespace detail
{
template<typename DevDst, typename DevSrc>
auto canZeroCopy(const DevDst& devDst, const DevSrc& devSrc) -> bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be constexpr ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. I chose to be conservative for now in case we ever need to call a runtime function to figure out whether we can zero-copy between two devices.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood.

By the way, what should be the behaviour of zero_memcpy(queue, bufOnSomeGpu, bufOnOtherGPu); ?

Make a copy, or let SomeGpu access the buffer from OtherGpu memory through nvlink/PCIe/etc. ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a very good question and I don't know what's best. I added it to #28.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer the second version as it makes use of (fast) hardware capabilities. Now the question to me is: Can and should we decide if this is possible automatically or let the users declare what they want

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I renamed the feature makeAvailable now, which is what its intend is. So makeAvailable(queue, bufOnSomeGpu, bufOnOtherGPu) should make the memory available on another GPU. If zero-copying is possible by any chance (mapping memory through PCIe, nvlink etc), it should be done. If non of that is possible, a full copy is performed.

If the user intends to have a full copy anyway, they should just say so and write memcpy(queue, bufOnSomeGpu, bufOnOtherGPu). In that sense, we should decide this automatically. But that decision may depend on the user having declared the involved buffers in a certain way.

If the user intended to use a buffer over nvlink between 2 CUDA CPUs (a very explicit use), then they don't need to use alpaka and can just call the CUDA API directly.

@psychocoderHPC
Copy link
Member

After discussion #1820 (comment) I think this method can be useful but is very dangerous for the user to use. There are two cases

  • the buffer is visible on the other device e.g. via unified memory or managed memory. In this case an operation on the destination memory will change the values in the destination
  • in cases deep copy of the source buffer is created changes must be explicitly copied back to the source after the destination is changed.

As mentioned in #1820 (comment) I think this behavior must be reflected in the name of the "copy" function.

@psychocoderHPC
Copy link
Member

psychocoderHPC commented Jan 25, 2023

I think the function canZeroCopy should stay because this function is saying if you need to copy data or if it is directly visible but I suggest instead of using the function name zero_memcopy -> moveBuffer() (camel case because I think all names in alpaka are at the moment camel case). The only problem I see is that move somehow implies that the source buffer should not be used anymore.

Maybe the function should be called makeDataAvailable()?

@bernhardmgruber bernhardmgruber changed the title Zerocopy Allowing zerocopy: makeAvailable(queue, dst, srcView) Jan 25, 2023
Adds an overload set `makeAvailable` that only copies a buffer if the destination device requires the buffer in a different memory space. Otherwise, no copy is performed and just the handles are adjusted.

Fixes: alpaka-group#28
typename TViewSrc,
std::enable_if_t<isDevice<TDevDst>, int> = 0,
typename TViewDst = Buf<TDevDst, Elem<TViewSrc>, Dim<TViewSrc>, Idx<TViewSrc>>>
ALPAKA_FN_HOST auto makeAvailable(TQueue& queue, TDevDst const& dstDev, TViewSrc const& viewSrc) -> TViewDst
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please extend the documentation and point to the fact that the user must take care that all operations on viewSrc are finished.
This was one point where I was always confused in this PR but now I understand that the user must link the queue to avoid data races.

alpaka::enqueue(quereWhereViewSrcIsUsed, event);
wait(destQueue, event);
auto newBuff = alpaka::makeAvailable(destQueue, destDev, viewSrc));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bernhardmgruber dstDev is available in the queue parameter, is it really required to have this extra parameter queue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please extend the documentation and point to the fact that the user must take care that all operations on viewSrc are finished.

If you really insist, I can do it. It is also not documented for alpaka::memcpy btw. and I think it is kind of obvious. It's also not true. You can have pending operations on the source view:

auto bufHost = alpaka::allocBuf<int, Idx>(devHost, extents);
auto bufAcc = alpaka::makeAvailable(queue, devAcc, bufHost);
alpaka::exec<Acc>(queue, workdiv, Kernel{}, alpaka::getPtrNative(bufAcc));
alpaka::makeAvailable(queue, bufHost, bufAcc); // pending kernel execution on bufAcc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dstDev is available in the queue parameter, is it really required to have this extra parameter queue?

yes, because the device of the queue is not always the destination device. See example above, where the destination is the host, but the queue's device is the accelerator.

@bernhardmgruber
Copy link
Member Author

After discussion #1820 (comment) I think this method can be useful but is very dangerous for the user to use. There are two cases

  • the buffer is visible on the other device e.g. via unified memory or managed memory. In this case an operation on the destination memory will change the values in the destination
  • in cases deep copy of the source buffer is created changes must be explicitly copied back to the source after the destination is changed.

I don't think this function is dangerous. It's just that we don't guarantee what the state of the source buffer is after makeAvailable. It can be disconnected from the destination or not, and you don't know. In fact, I think you don't need to know. But we could provide that extra information via an additional return value if necessary.

I think this behavior must be reflected in the name of the "copy" function.

Maybe the function should be called makeDataAvailable()?

[...] I suggest instead of using the function name zero_memcopy -> moveBuffer()

Are you fine with the proposed makeAvailable?

I think the function canZeroCopy should stay because this function is saying if you need to copy data or if it is directly visible

This is an internal function and I think it's bad. Because you actually need a mix of compile and runtime information to get the data types of the buffers right, and canZeroCopy does not handle that ATM. See also #28 (comment).

The only problem I see is that move somehow implies that the source buffer should not be used anymore.

Which is exactly what I want.

@bernhardmgruber bernhardmgruber marked this pull request as draft July 31, 2023 14:35
@bernhardmgruber bernhardmgruber changed the title Allowing zerocopy: makeAvailable(queue, dst, srcView) [abandoned] Allowing zerocopy: makeAvailable(queue, dst, srcView) Mar 2, 2024
@bernhardmgruber
Copy link
Member Author

bernhardmgruber commented Mar 2, 2024

I am abandoning work on this. Feel free to use or discard anything from the changeset provided here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants