Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Suboptimal swap performance on universal vectors #2948

Open
1 task done
gevtushenko opened this issue Nov 25, 2024 · 3 comments · May be fixed by #2985
Open
1 task done

[BUG]: Suboptimal swap performance on universal vectors #2948

gevtushenko opened this issue Nov 25, 2024 · 3 comments · May be fixed by #2985
Assignees
Labels
bug Something isn't working right.

Comments

@gevtushenko
Copy link
Collaborator

Is this a duplicate?

Type of Bug

Performance

Component

Thrust

Describe the bug

Swap of universal vectors is unnecessary slow and occupies more memory than it should.
Device and host vectors have swap overloads in thrust:: namespace.
These overloads call efficient .swap member function:

template <typename T, typename Alloc>
void swap(device_vector<T, Alloc>& a, device_vector<T, Alloc>& b)
{
a.swap(b);
}

Universal and universal host pinned vectors are missing such an overload and the vector_base they alias to has swap overload in thrust::details, so it’s unreachable.

How to Reproduce

template <template <typename> class Vector>
void swap_time(const char* str) {
    Vector<char> a(1 << 30, 'a');
    Vector<char> b(1 << 30, 'b');

    auto begin = std::chrono::high_resolution_clock::now();
    thrust::swap(a, b);
    // a.swap(b);
    auto end = std::chrono::high_resolution_clock::now();
    std::cout << str << " swap time: " << std::chrono::duration<double>(end - begin).count() << " s" << std::endl;
    std::cout << "a: " << static_cast<char>(a[0]) << "; "
              << "b: " << static_cast<char>(b[0]) << std::endl;
}

int main() {
    swap_time<thrust::device_vector>("device");
    swap_time<thrust::host_vector>("host");
    swap_time<thrust::universal_vector>("universal");
    swap_time<thrust::universal_host_pinned_vector>("universal_host_pinned");
}

Expected behavior

  • Swap of universal host pinned vector is 30 million times faster
  • Swap of universal vector is 4 million times faster

Reproduction link

No response

Operating System

No response

nvidia-smi output

No response

NVCC version

No response

@gevtushenko gevtushenko added the bug Something isn't working right. label Nov 25, 2024
@github-project-automation github-project-automation bot moved this to Todo in CCCL Nov 25, 2024
@gevtushenko gevtushenko added the good first issue Good for newcomers. label Nov 25, 2024
@bernhardmgruber bernhardmgruber removed the good first issue Good for newcomers. label Nov 27, 2024
@bernhardmgruber
Copy link
Contributor

bernhardmgruber commented Nov 28, 2024

While I agree that we should fix the performance issue, I have troubles with how we swap vectors. The established practice in C++ on how to swap any two objects is:

using std::swap;
swap(a, b);

You make std::swap available and then ADL-call swap, so if a better (more specialized) implementation is available, it will be chosen. Alternatively, you can use using cuda::std::swap;. Also using thrust::swap; works here. All these versions don't show the performance problem.

Only calling thrust::swap(a, b); directly is slow for universal vector, because the semantics are: "use the thrust swap algorithm" and not "find the best swap algorithm, while [std|cuda::std|thrust]::swap is available as fallback".

@bernhardmgruber
Copy link
Contributor

bernhardmgruber commented Nov 28, 2024

There are two ways to address this.

  1. Turn the aliases for thrust::universal_vector etc. into dedicated types and add specializations inside namespace thrust. All swap optimizations would only work for thrust::swap. All other swaps (from cuda::std or std) would do 3 moves.
  2. Make thrust::swap an alias to cuda::std::swap and add a bunch of fixes. All ways to swap would now take the optimal path. This requires some fixes to libcu++ as well, including this one: [BUG]: cuda::std::swap cannot ADL-two-step-swap a type from std #2984.

I strongly prefer 2.

bernhardmgruber added a commit to bernhardmgruber/cccl that referenced this issue Nov 28, 2024
@bernhardmgruber bernhardmgruber linked a pull request Nov 28, 2024 that will close this issue
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Nov 28, 2024
bernhardmgruber added a commit to bernhardmgruber/cccl that referenced this issue Nov 28, 2024
bernhardmgruber added a commit to bernhardmgruber/cccl that referenced this issue Nov 28, 2024
@bernhardmgruber
Copy link
Contributor

I implemented solution 2. in #2985, which is independent of #2984.

bernhardmgruber added a commit to bernhardmgruber/cccl that referenced this issue Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working right.
Projects
Status: In Review
Development

Successfully merging a pull request may close this issue.

2 participants