Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory usage in the sieve #45

Open
haampie opened this issue May 9, 2017 · 9 comments · May be fixed by #87
Open

Reduce memory usage in the sieve #45

haampie opened this issue May 9, 2017 · 9 comments · May be fixed by #87

Comments

@haampie
Copy link

haampie commented May 9, 2017

Currently, memory usage of the prime sieve increases linearly with the size of the interval [lo, hi], while the number of primes only grows asymptotically as (hi) / log (hi). Furthermore, L1 memory caching is not exploited.

By implementing a segmented sieve both problems can be addressed at once: lowering memory consumption and increasing speed (in C++ I noticed up to 10x 5x better performance).

The current algorithm is basically:

  1. Discover prime numbers p less than √hi
  2. Cross off multiples of p starting at p * p.
  3. Collect all prime numbers in [√hi, hi]

The suggested procedure is:

  1. Discover all prime numbers p less than √hi
  2. Divide [√hi, hi] into fixed size segments / smaller intervals (fitting L1 cache)
  3. Outer loop: iterate over each segment
    • Inner loop: cross off all multiples of all primes in the current segment
    • Collect all primes in the current segment

You only need to store one segment (constant memory allocation) in step 2/3 and if the segment fits in L1 cache it is extremely fast, since you're iterating over it multiple times when sieving.

For reference see http://primesieve.org/

@ararslan
Copy link
Member

ararslan commented May 9, 2017

Would you be willing to submit a PR for this? cc @pabloferz, who I think is quite familiar with the sieve code.

@haampie
Copy link
Author

haampie commented May 9, 2017

I could definitely look into it, but I'll have to experiment with the (very cool) wheel-idea first.

One issue: primesmask is exported as well and returns a boolean array over the whole interval [lo, hi], so that function can clearly not benefit from memory reduction. It could still benefit from cache optimization, but right now I think it is not completely trivial to use the same segmented sieve code for both primes and primesmask.

Also, I think I should show some numbers, otherwise it's just speculation :p

@pabloferz
Copy link
Member

I wanted to implement a segmented sieve once, the only problem is, AFAIK, that there is no native way in Julia to find L1 cache sizes for any system, of course there are a couple of packages that we could use, but I don't fancy the idea of depending on them (I'd like to keep Primes.jl lightweight on dependencies).

An option is to chose a fixed size that certainly fits on L1 cache for the architectures that julia supports, but that would miss some performance on systems with bigger L1 cache. And yeah, in this case primesmask and primes would probably need to rely on different implementations, though I believe there's a way of mixing the current approach with a segmented sieve (I haven't really thought about it).

If you want to submit a PR, I can certainly have a look.

@haampie
Copy link
Author

haampie commented May 9, 2017

An optimistic (over)estimation of the cache size would still reduce cache misses with respect to the current implementation, but indeed it is messy to have a constant around. It would be interesting to see a cache-oblivious algorithm, but I doubt it exists.

One idea is to let the user decide the segment size and provide a reasonable default. Maybe a signature like

primes(lo, hi; segment_size = 32 * 1024)

@oscardssmith
Copy link
Member

Is there a way to see if a package is installed, and use it if it exists?

@ararslan
Copy link
Member

@oscardssmith No

@pabloferz
Copy link
Member

@oscardssmith just for reference, that kind of question is better suited for a site like https://discourse.julialang.org/

@haampie
Copy link
Author

haampie commented Dec 18, 2019

Since this is the first issue I have ever opened related to Julia, I feel like I should close it at some point as well.

At the moment I have implemented an absolute monstrosity that unrolls the inner loop of the {2,3,5}-prime wheel sieve with @goto statements such that it removes 8 multiples of a prime per iteration. It's roughly 9x faster than the current Primes._primesmask function for n around 500_000. Probably I could generate this code from a macro.

The next step is to reduce the memory from O(n) to O(√n / log n) by saving just an O(1) chunk of memory that fits in L1 cache, and keeping an O(√n / log n) list of siever primes up to √n. I'm hoping that would also be 5x to 10x faster for large n...

Lastly with Julia 1.3 it should not be hard to make it multithreaded, right?

@haampie
Copy link
Author

haampie commented Dec 22, 2019

I've now implemented a cache-friendly version here: https://github.com/haampie/FastPrimeSieve.jl. It generates the unrolled sieving loop with macro's, which is kinda neat.

I'm getting a 12x speedup 🚀 for the prime-counting function for n = 1_000_000_000.

Last things are to see how to contribute it back to this package and consider multithreading.

@haampie haampie linked a pull request Sep 7, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants