Skip to content
This repository has been archived by the owner on May 5, 2024. It is now read-only.

Numba/Numpy version #4

Closed
ktritz opened this issue Feb 28, 2017 · 32 comments
Closed

Numba/Numpy version #4

ktritz opened this issue Feb 28, 2017 · 32 comments

Comments

@ktritz
Copy link

ktritz commented Feb 28, 2017

I've implemented a fairly direct port/optimization using Numba for python. This runs ~10x faster for single point noise calculations, and 50-100x faster for calculating noise arrays. Obviously, this requires installation of the Numba library, but people are free to check it out if they're interested.

@lmas
Copy link
Owner

lmas commented Feb 13, 2018

I'll leave this open for others to take note of, if there's a need for quick optimization.

Don't like introducing a 3rd party dependency (besides the test utils) what with all the leftpad nonsense and friends going around the developer communities, though, so I'll leave it at that.

@ktritz
Copy link
Author

ktritz commented Feb 13, 2018

Sounds good, I didn't expect it to enter the main branch. Thanks.

@ktritz
Copy link
Author

ktritz commented Feb 14, 2018 via email

@GAZ082
Copy link

GAZ082 commented Feb 22, 2018

@ktritz dude, your implementation has eye-watering speeds. Truly impressive. You should make a separate package or something, so it can be easily installed with pip.

Kudos.

@mfeif
Copy link

mfeif commented Sep 25, 2019

Thanks for this library, both @lmas and @ktritz!

@ktritz would you be able to put a quick bit in the readme on your fork that shows how to engage the speed-ups for those who are willing to use numba? The api doesn't seem to have changed, so it's hard to tell. I don't notice any speedups going from one fork to another (but I admit I haven't timed it). Should I see a clear speed up even for trivial operations like making an image grid, or do I need to engage numba in a meaningful way when I send in x, y?

Thanks

@endolith
Copy link

I was also going to ask why this wasn't numpy-based, but I see it's been proposed multiple times. Can you add something about this in the documentation and link to the forks?

@lmas
Copy link
Owner

lmas commented Mar 30, 2021

Hmm been thinking that I might maintain a numpy based branch too, so it could be released on pypi

@endolith
Copy link

So the numpy version would take an 2D array of coordinates as input?

@lmas
Copy link
Owner

lmas commented Apr 1, 2021

Haven't worked with numpy before. Could you describe your use case more?

@endolith
Copy link

endolith commented Apr 2, 2021

Instead of

    for y in range(0, HEIGHT):
        for x in range(0, WIDTH):
            value = simplex.noise2d(x / FEATURE_SIZE, y / FEATURE_SIZE)

you would use something like:

x, y = np.mgrid[0:WIDTH, 0:HEIGHT]
values = simplex.noise2d(x / FEATURE_SIZE, y / FEATURE_SIZE)

and it would calculate all the values simultaneously, in much less time.

@lmas
Copy link
Owner

lmas commented Apr 2, 2021

Ah yeah kinda obvious usage now that you explained it, thanks. And it's kinda obvious must have feature too..
Hmm starting to think numpy is a pretty much required dependency I can't live without any more.

Alright so I'm heavily considering adding numpy//numba optimizations to the main branch (forget multiple pkgs).
Pros are obvious (performance, no change in old API, seems easy to incorporate) but are there any cons? Numpy too big and heavy for small scripts? Not suited for game dev? (I assume numpy is commonly used in CGI or research work and it's already installed in those environments)

I would like more input from more users if possible, please.

@lmas lmas changed the title Numba version Numba/Numpy version Apr 2, 2021
@endolith
Copy link

endolith commented Apr 2, 2021

In my experience, numpy is a much more common dependency than numba. numba is more likely to have problems with installation on different platforms, etc. and shouldn't be necessary if you can vectorize the code using numpy. numba is for for loops that can't be vectorized.

@ktritz
Copy link
Author

ktritz commented Apr 3, 2021 via email

@lmas
Copy link
Owner

lmas commented Apr 3, 2021

Thank you. If the gains are big enough I might include numba too then. Maybe.

@ktritz
Copy link
Author

ktritz commented Apr 3, 2021 via email

@prdx23
Copy link

prdx23 commented Apr 27, 2021

If you are implementing an optimized numpy/numba version, it might be useful to declare them as optional dependencies in setup.py using extras_require. This will make it so that users who want to install the optimized version can use pip install opensimplex[numpy] and this can be documented in the readme. This will also make sure you mark an exact version of numpy as a dependency which can then also be used to test against in the tests if needed.

In the package itself, you can detect if numpy is present with find_spec from importlib as mentioned here and then switch which implementation is active.

@lmas
Copy link
Owner

lmas commented May 4, 2021

Hmm I don't know, not a fan of complicating the installation or the code for now. Feels like numpy is common enough that I could add it as a hard, simple dependency instead. I'll reconsider in the future if people has problems or suddenly stronger opinions.

Pinning versions/vendoring is of course always a good thing 👍

@lmas lmas mentioned this issue May 9, 2021
11 tasks
@rometsch
Copy link

rometsch commented May 9, 2021

Thanks for the package, @lmas!

I just finished optimizing my own fork using numba before reading the issues and arrived at something very similar to the verions of @ktritz.
A good exercise in using numba :)

My impression is also that almost anyone who needs a 2d noise function should have numpy installed already.

To give an example of @ktritz 's suggestion with the try import/except:

try:
    from numba import jit
except ImportError:
    def jit(*args, **kwargs):
        def wrapper(func):
            return func
        return wrapper

This way, anyone with numba installed could benefit from a speedup, which could be stated somewhere in the documentation, and for everyone else, the jit decorator will do nothing.

@lmas
Copy link
Owner

lmas commented May 10, 2021

Thank you and thanks for the suggestion.

Which functions benefit from the numba decorator? And did you run the make bench benchmarking before/after?
I'm just curious and want more benchmarks for it besides my own (eventually).

@rometsch
Copy link

rometsch commented May 10, 2021

Function benefiting from / viable for just in time compilation

The functions which are decorated are the extrapolate functions, the noise functions themselves and some helper functions (noise2da) to iterate over the input arrays (these are flattend by calling .ravel() on them which I think doesn't copy the data...).

@njit(cache=True)
def extrapolate2d(xsb, ysb, dx, dy, perm):
    index = perm[(perm[xsb & 0xFF] + ysb) & 0xFF] & 0x0E

    g1 = GRADIENTS_2D[index]
    g2 = GRADIENTS_2D[index+1]
    return g1 * dx + g2 * dy

@njit(cache=True)
def noise2da(a, x, y, perm):
    for n in range(len(x)):
        a[n] = noise2d(x[n], y[n], perm)

@njit(cache=True)
def noise2d(x, y, perm):
    """
    Generate 2D OpenSimplex noise from X,Y coordinates.
    """
    # Place input coordinates onto grid.
    stretch_offset = (x + y) * STRETCH_CONSTANT_2D
    xs = x + stretch_offset
    ys = y + stretch_offset
    ...

njit is the same as jit(nopython=True) which forces numba to make sure every function that's called is compiled too.

These functions where moved outside of the class and the perm array is passed around.
This way was easier than decorating the class.

Essentially all the functions that do calculations with basic data types like any version of int and float are just in time compilable.

Numba has to figure out the type of the parameters and variables in order to compile them.
The compiled version is statically typed.
One can also specify the type by hand, but this is not needed here.

Up until now, I was not able to get any functions with dicts or classes running.
But for 'simpler' functions, it's relatively easy.

My benchmark results

I extended the benchmark to 1 000 000 function calls.

For the current version:

for i in range(1000000):
    oldsimplex.noise2d(0.1, 0.1)
    oldsimplex.noise3d(0.1, 0.1, 0.1)
    oldsimplex.noise4d(0.1, 0.1, 0.1, 0.1)
         24000004 function calls in 20.616 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   20.616   20.616 {built-in method builtins.exec}
        1    0.000    0.000   20.616   20.616 <string>:1(<module>)
        1    1.222    1.222   20.616   20.616 tests/benchmark_oldopensimplex.py:22(run)
  1000000    5.661    0.000    9.095    0.000 /home/rometsch/repo/opensimplex/tests/old_opensimplex.py:735(noise4d)
  1000000    4.053    0.000    6.352    0.000 /home/rometsch/repo/opensimplex/tests/old_opensimplex.py:240(noise3d)
  1000000    2.507    0.000    3.947    0.000 /home/rometsch/repo/opensimplex/tests/old_opensimplex.py:137(noise2d)
  5000000    3.127    0.000    3.127    0.000 /home/rometsch/repo/opensimplex/tests/old_opensimplex.py:126(_extrapolate4d)
  4000000    2.067    0.000    2.067    0.000 /home/rometsch/repo/opensimplex/tests/old_opensimplex.py:117(_extrapolate3d)
  3000000    1.282    0.000    1.282    0.000 /home/rometsch/repo/opensimplex/tests/old_opensimplex.py:110(_extrapolate2d)
  9000000    0.697    0.000    0.697    0.000 {built-in method math.floor}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

For my modified version:

Here, I used a linarly spaced 1d array between 0 and 1 for all coordinates.

x = np.linspace(0, 1, number)
simplex.noise2d(x, x)
simplex.noise3d(x, x, x)
simplex.noise4d(x, x, x, x)

The numpy array is iterated over inside a function optimized by numba.

@njit(cache=True)
def noise2da(a, x, y, perm):
    for n in range(len(x)):
        a[n] = noise2d(x[n], y[n], perm)
         93 function calls (90 primitive calls) in 0.280 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.280    0.280 {built-in method builtins.exec}
        1    0.000    0.000    0.280    0.280 <string>:1(<module>)
        1    0.000    0.000    0.280    0.280 tests/benchmark_opensimplex.py:16(run)
        1    0.000    0.000    0.173    0.173 /home/rometsch/repo/opensimplex/tests/src_opensimplex.py:142(noise4d)
        1    0.173    0.173    0.173    0.173 /home/rometsch/repo/opensimplex/tests/src_opensimplex.py:171(noise4da)
        1    0.000    0.000    0.066    0.066 /home/rometsch/repo/opensimplex/tests/src_opensimplex.py:127(noise3d)
        1    0.066    0.066    0.066    0.066 /home/rometsch/repo/opensimplex/tests/src_opensimplex.py:165(noise3da)
        1    0.000    0.000    0.035    0.035 /home/rometsch/repo/opensimplex/tests/src_opensimplex.py:114(noise2d)
        1    0.035    0.035    0.035    0.035 /home/rometsch/repo/opensimplex/tests/src_opensimplex.py:159(noise2da)
        1    0.000    0.000    0.006    0.006 <__array_function__ internals>:2(linspace)
      4/1    0.000    0.000    0.006    0.006 {built-in method numpy.core._multiarray_umath.implement_array_function}
        1    0.001    0.001    0.006    0.006 /home/rometsch/.local/lib/python3.8/site-packages/numpy/core/function_base.py:26(linspace)
        1    0.005    0.005    0.005    0.005 {built-in method numpy.arange}

This comparison running on an Intel Core i7-5820K yield a speedup of about 74.

@lmas
Copy link
Owner

lmas commented May 10, 2021

Hmm that's quite impressive but ncalls = 1 feels weird? Did you also verify you're getting correct output?

@rometsch
Copy link

Yes, I also ran tests/test_opensimplex.py which says everything ok.

I think the ncalls = 1 is because it runs noise2da once, which then iterates over the array, so it might not be visible to the python profiler.

@rometsch
Copy link

Also, I just finished a numpy based port of the noise2d function.
Seed the branch on my fork here.

I only did the 2d function, because all the if/else clauses make it really messy!

There is also a bit performance hit when using numpy array slicing instead of the ifs.

For a 1000 x 1000 grid, the runtimes (with %timit in Jupyter) are as follows:

numba optimized : 35.4 ms ± 505 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
numpy : 327 ms ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy optimized with njit: 278 ms ± 6.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Looks like the algorithm is just not well suited for optimization with numpy array slicing.
There is, of course, also the possibility that my attempt is lacking!

@lmas
Copy link
Owner

lmas commented May 10, 2021

Yes, I also ran tests/test_opensimplex.py which says everything ok.

Ah ok thanks, sounds good then 👍 And yeah the overall structure can be a lil' messy as the old upstream source was in java and my port wasn't very pythonic from the beginning I think heh.

Also a little suprised there was that much of a diff. between the version, was having the impression that numpy was the better one as it was working with arrays in a leaner way.

But I'm getting carried away and should really stop and warn you before you waste too much time on it! I'm busy porting the new opensimplex2 and will probably push it this week (which will deprecate v0.3)

@rometsch
Copy link

Thanks for the warning :) Looking forward to the new version.

The optimization issue is quite interesting for me to explore as well.

The numpy implementation was just to see whether its a viable route to go - it seems that the array perspective is not the ideal one here.

And yeah the overall structure can be a lil' messy as the old upstream source was in java and my port wasn't very pythonic from the beginning I think heh.

I don't think its your implementation or port.
The algorithm just requires to check for multiple conditions and its simply crunching numbers with no way around it.

Just a little more speculation from my side about my implementation with numpy arrays:

The biggest drawback is, that I have to explore the whole possibility tree with the whole array, meaning I calculate whether or not to a calculation inside a if/else statement for all entries and then only do the calculation for a subset.
I suppose the bottleneck is memory latency because the array is iterated over and over again (for the checks whether a condition applies) which suffers from cache misses all the time.
In contrast, for serial execution of the function, we can expect that most of the variables can be held in cache and if/else is very inexpensive anyways.

I didn't test this though, so take it with a grain of salt ;)

@lmas
Copy link
Owner

lmas commented May 11, 2021

I think you're pretty spot on and matches upstream's thoughts.
As I mentioned in that issue, I too was naively thinking that a lookup table would perform way better (I somehow thought that the array would be cached efficiently and the lookups would have smaller impact on performance, compared to running the same if/else and branch predictions and whatever else over and over all the time).

Starting to appreciate numba's caching more now 😁
Edit: Hmm on the other hand, numba seems to be severely limited with type interfering and poor support for classes. Really hard to work with...

@rometsch
Copy link

I just had a quick try on optimizing the new version on the dev branch apart from having to replace the classes for points and gradients with arrays (numba does not support arrays of objects...) it works.

When you are finished, I'll have a look at your port to see if I can optimize it.

The new version looks much cleaner. It gives much more hints at what's going on :)

@lmas
Copy link
Owner

lmas commented May 12, 2021

Yeah it was pretty nice and clean to work with. In that linked issue the author talks about refactoring away the array lookups for performance so it looks like it's back to long rows of if/else branches, which hopefully increases performance again as my port is about twice as slow as v0.3 (ouch).
I'll also have to study your your solutions for numba, it was frustrating with all the custom classes and objects heh

Kinda holding of any more work until Simplex2S has been updated, Maybe Soon ™️

@MightyBOBcnc
Copy link

MightyBOBcnc commented Jul 23, 2021

I would like more input from more users if possible, please.

A bit late to the party but I would very much appreciate further acceleration of performance. :-)

numpy is pretty mature/common now. (As an example it's one of the "standard" libraries that ships with Blender. Blender has a python API for developing add-ons and numpy is a boon for operations on dense/heavy meshes.)

Although numba might be a bit more of a pain to install from scratch it's just as easy to get from Anaconda as numpy is.

As an end user my use case is:
I'm creating a heavily subdivided icosahedron 'icosasphere' with meshzoo and then sampling the xyz of each vertex with your opensimplex 0.3 library to generate height offsets.

k = 500
seed = 1337
verts, tris = mz.icosa_sphere(k)
noisy_verts, elevations = sampleNoise(verts, seed)

# Do stuff with noisy_verts and elevations
def sampleNoise(verts, noise_seed=0):
    """Sample a simplex noise for given vertices"""
    scale_factor = 0.05
    roughness = 12
    tmp = OpenSimplex(seed=noise_seed)
    elevations = np.zeros((len(verts), 1))
    for v, vert in enumerate(verts):
        x, y, z = vert[0] * roughness, vert[1] * roughness, vert[2] * roughness
        elevations[v] = tmp.noise3d(x,y,z) * scale_factor + 1
    new_verts = verts * elevations
    return new_verts, elevations

For the simplest possible icosahedron k=1 (12 vertices, 20 triangles) the array of verts to sample looks like:

[[-0.52573111  0.85065081  0.        ]
 [ 0.52573111  0.85065081  0.        ]
 [-0.52573111 -0.85065081  0.        ]
 [ 0.52573111 -0.85065081  0.        ]
 [ 0.         -0.52573111  0.85065081]
 [ 0.          0.52573111  0.85065081]
 [ 0.         -0.52573111 -0.85065081]
 [ 0.          0.52573111 -0.85065081]
 [ 0.85065081  0.         -0.52573111]
 [ 0.85065081  0.          0.52573111]
 [-0.85065081  0.         -0.52573111]
 [-0.85065081  0.          0.52573111]]

And with a higher subdivision the result is like so after sampling and applying the noise:

image

For 100,002 vertices (k=100) this isn't too bad; it only takes ~2.5 seconds which is a tolerable amount of time to wait, even if it's 2.5*N for N number of noise octaves.
But beyond that it won't scale very well. 100k vertices isn't that many; I need millions.
For 2.5 million vertices (k=500) the time is ~64 seconds per octave.
8.1 million vertices (k=900) is ~204 seconds per octave.
62.5 million vertices (k=2500) takes ~26 minutes per octave which is an eternity for rapid iteration.

I also tried some other variations of the function like the one below but performance was the same for all of them within a few seconds of each other, but when it's already taking 3 and a half minutes to run then 204 vs 207 seconds isn't much to crow about.

def sampleNoise2(verts, noise_seed=0):
    """Sample a simplex noise for given vertices"""
    scale_factor = 0.05
    roughness = 12
    tmp = OpenSimplex(seed=noise_seed)
    roughed_verts = verts * roughness
    elevations = np.array([[tmp.noise3d(v[0],v[1],v[2]) * scale_factor + 1] for v in roughed_verts])
    new_verts = verts * elevations
    return new_verts, elevations

However, using ktritz's original numba fork as a drop-in replacement (without changing a single line of my own code for additional optimization):
8.1 million vertices went from ~204 seconds to ~13.8 seconds
62.5 million vertices went from ~26 minutes to ~106 seconds

Still slower than I'd like, but a drastic improvement (although the fork is based on 2.2 I believe).

Now the variations of the sampling function start to matter. Best one so far is almost the same as the first but got rid of the separate x, y, z and saved 6 seconds (19 vs 13 seconds for 8.1 mil vertices):

def sampleNoise4(verts, noise_seed=0):
    """Sample a simplex noise for given vertices"""
    scale_factor = 0.05
    roughness = 12
    tmp = OpenSimplex(seed=noise_seed)
    elevations = np.zeros((len(verts), 1))
    rough_verts = verts * roughness
    for v, vert in enumerate(rough_verts):
        elevations[v] = tmp.noise3d(vert[0],vert[1],vert[2]) * scale_factor + 1
    new_verts = verts * elevations
    return new_verts, elevations

@lmas
Copy link
Owner

lmas commented Aug 18, 2021

Sorry for the late reply, moving across the country and prepping for university..

Thank you a lot for your detailed input! And a very cool project, love seeing 3D results! May I ask what's the purpose of generating this detailed icosahedron?

And yeah I hear you, need getting this optimisation done sooner! There's been no updates to Simplex2S so far, so I'm thinking I'll just update the current version as it's still performing better than the new dev version.

Thank you for another use case!

@MightyBOBcnc
Copy link

May I ask what's the purpose of generating this detailed icosahedron?

I'm building a procedural planet generator. Planets--as we know--are large, so I need a lot of points if I want to represent a reasonable amount of detail from a certain distance. I chose an icosahedron as the distribution of points is more uniform than a UV sphere or a sphereized cube.

Since I wrote my previous post I've made some further modifications. I got rid of the OpenSimplex class entirely from ktritz's fork (jit decorating a class is a bit more involved than I can handle at the moment) and I'm calling the needed functions directly (line 121 and 297 in his fork). It's slightly less convenient but it offered another performance boost on top of the one I mentioned previously because now I can jit decorate my noise sampling function where previously I couldn't because numba didn't know how to handle the OpenSimplex object.

import numpy as np
from numba import njit, prange
import opensimplex as osi
import meshzoo as mz

def main():
    k = 500
    world_seed = 1337
    verts, tris = mz.icosa_sphere(k)

    # Initialize the permutation arrays to be used in noise generation
    perm, pgi = osi.init(world_seed)

    elevations = sample_noise(verts, perm, pgi, 1.6, 0.4)

    # Do stuff with elevations
@njit(cache=True, parallel=False)
def sample_noise(verts, perm, pgi, n_roughness=1, n_strength=0.2):
    """Sample a simplex noise for given vertices"""
    elevations = np.ones(len(verts))
    rough_verts = verts * n_roughness

    # Block 1
    # Comment me out and try Block 2 instead
    for v, vert in enumerate(rough_verts):
        elevations[v] = osi.noise3d(vert[0], vert[1], vert[2], perm, pgi)

    # Block 2
    # Uncomment me and set parallel=True in the njit decorator
    # for v in prange(len(rough_verts)):
    #     elevations[v] = osi.noise3d(rough_verts[v][0], rough_verts[v][1], rough_verts[v][2], perm, pgi)

    return np.reshape((elevations + 1) * 0.5 * n_strength, (len(verts), 1))

To show the further improvements to performance you can see comments for two code blocks in sample_noise. Here's the improvement with Block 1:

8.1 million vertices went from ~204 seconds to ~13.8 seconds ----> It's ~0.44 seconds now
62.5 million vertices went from ~26 minutes to ~106 seconds -----> It's ~3.3 seconds now, wow!

Block 1 is single-threaded performance. If we use Block 2 instead of 1, and set parallel=True to take advantage of numba's prange threading then...

8.1 million vertices went from ~204 seconds to ~13.8 seconds to ~0.44 seconds ----> It's ~0.08 seconds now
62.5 million vertices went from ~26 minutes to ~106 seconds to ~3.3 seconds ----> It's ~0.66 seconds now, holy cow!!

26 minutes to 0.66 seconds. Bananas! On my Ryzen 3900x my CPU can barely touch 100% utilization before it's over (you can clearly see the spike in all cores when prange kicks in). I'd have to throw more than 60 million vertices at it to get my CPU to 100% for any appreciable amount of time.
image

@lmas
Copy link
Owner

lmas commented Oct 5, 2021

Thank you a lot, I appreciate your detailed replies and stats! Using parallel=true and prange was new tricks for me
so thanks for that too.

Now then, I've finally gotten around to investigate most other linked forks from people in this ticket and how they've
attempted to optimize the lib. In the end I decided it was easier to implement the changes myself, so it would fit in
with the initial overhaul. I also reverted opensimplex v2 as my attempts failed to match the performance from v1.

New changes has been pushed to dev that adds numpy as a hard requirement and numba as an option, but I haven't managed
to properly benchmark it yet (no proper dev rig atm). Before finishing the update I would like to, preferably, have it
more tested/benchmarked by someone else.
Pretty please?

Notice: API names have been changed

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants