Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add criterion based benchmarks #356

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

Add criterion based benchmarks #356

wants to merge 16 commits into from

Conversation

jonboh
Copy link
Contributor

@jonboh jonboh commented May 13, 2023

Hi! following up on #10.

I've trasformed the existing examples into benchmarks. I've kept the original examples, as benchmarks should be more stable than examples to allow comparing performance across time, so they might diverge from the current examples and changes in the examples should not modify the benchmark metrics.
I've, for the most part, kept the optimization problem parameters as they were in the examples.

Most of the benchmarks don't do much more than running the (rewritten) example. However in the case of ParticleSwarm, LBFGS and BFGS I've expanded the example to run the optimization with the different backends, and in the case of LBFGS and BFGS to run with multiple dimensions. Hopefully this can serve as a starting point to refine these benchmarks further. I wanted to have some feedback before going any further.

There are two issues that I wanted to discuss (maybe I'm doing something wrong with nalgebra and ndarray):

  1. In the BFGS example, I've written a Vec based problem that fails, however the ndarray version with the same parameters gets to a solution. I haven't gone any further, but I would have assumed that both versions should yield the same successful result.
  2. (The important one). In all my results I've found the ndarray and nalgebra backends to be slower than the Vec one. I've run the benchmarks on an AMD FX-8350 and an Intel i7-12800HX, and the performance difference persists in both cases.
    This is easily seen in the ParticleSwarm benchmark that is run on the three backends, it can also be seen in the LBFGS one (although this one lacks the nalgebra backend).

ParticleSwarm:
image

LBFGS (Axis input represents the amount of dimensions):
image
image

To run all benchmarks do:

cargo bench --features=_full_dev

To run just a benchmark file:

cargo bench --features=_full_dev <benchmark_file>
# example
cargo bench --features=_full_dev particleswarm

To run just a benchmark in a group:

cargo bench --features=_full_dev <benchmark_file> -- <benchmark_pattern>
# example, this will run just lbfgs/ndarray/4
cargo bench --features=_full_dev lbfgs -- lbfgs/ndarray/4
# example, this will run all the lbfgs/ndarray benchmarks
cargo bench --features=_full_dev lbfgs -- lbfgs/ndarray

The generated report: target/criterion/report/index.html

I think the benchmarks are already useful as they are right now, however it might be worthwhile to define a common Problem for all (for example the Rosenbrok problem is very similarly defined in at least 4 benchmarks) that doesn't produce convergence problems to avoid code duplication.
In the case of an error I've forced the benchmark to panic in order to avoid a failing solver to report an incredibly small time for its execution.
Another thing that might be worthwhile benchmarking is the impact of the loggers in the optimization performance. I might add something in that respect after going through you feedback on this.

I've also modified the bench profile to include debug symbols, this isn't strictly necessary but it generates the necessary info for generating flamegraphs with these benchmarks. (change in top level Cargo.toml)

Let me know what you think on the benchmark approach and the issues, any feedback is welcomed :)

jonboh added 16 commits May 11, 2023 17:43
Vec, nalgebra, ndarray
this benchmark has been moved to particleswarm in order to group
benchmark the backends
which is already included in lbfgs bench
I'm not sure the Vec benchmark is doing work, it runs on 500 nanoseconds vs
ndarray which runs on ~100 microseconds
This was the reason why the Vec version of LBFGS was running on 500 ns
(it just aborted)
this eases investigations using flamegraph
@stefan-k
Copy link
Member

Hi @jonboh, I am very sorry for the very very late reply. I have been rather busy over the last couple of months and currently only have limited access to the Internet (#360). This is an excellent addition and I'm highly thankful for all the work you have put into this. Unfortunately I don't have time right now to give this PR the attention it deserves. I do hope that I will be able to respond adequately and give a detailed review soon.

Thanks a lot!!

@jonboh
Copy link
Contributor Author

jonboh commented Jun 14, 2023

Nothing to worry about, whenever you are ready we can continue with this PR

@stefan-k
Copy link
Member

stefan-k commented Nov 5, 2023

Hi @jonboh ,

thanks again for the amazing work! :)

I've trasformed the existing examples into benchmarks. I've kept the original examples, as benchmarks should be more stable than examples to allow comparing performance across time, so they might diverge from the current examples and changes in the examples should not modify the benchmark metrics. I've, for the most part, kept the optimization problem parameters as they were in the examples.

Most of the benchmarks don't do much more than running the (rewritten) example. However in the case of ParticleSwarm, LBFGS and BFGS I've expanded the example to run the optimization with the different backends, and in the case of LBFGS and BFGS to run with multiple dimensions. Hopefully this can serve as a starting point to refine these benchmarks further.

This sounds like a good approach!

  1. In the BFGS example, I've written a Vec based problem that fails, however the ndarray version with the same parameters gets to a solution. I haven't gone any further, but I would have assumed that both versions should yield the same successful result.

Ah yes, the old "Search direction must be a descent direction" error :( I believe this is due to numerical instabilities when the gradient vanishes. This problem definitely needs to be investigated (but ideally not as part of this PR).

  1. (The important one). In all my results I've found the ndarray and nalgebra backends to be slower than the Vec one. I've run the benchmarks on an AMD FX-8350 and an Intel i7-12800HX, and the performance difference persists in both cases.
    This is easily seen in the ParticleSwarm benchmark that is run on the three backends, it can also be seen in the LBFGS one (although this one lacks the nalgebra backend).

I've had a look at these benchmarks. I was able to identify a couple of issues and I have a few ideas of what may be going on.

Regarding ParticleSwarm: This one doesn't really use any linear algebra, and as such I would expect all backends to perform similar. The only really computationally challenging part is probably sorting the population (particularly given that there are only two parameters to optimize). The fact that the populations are randomly initialized doesn't make benchmarking easier. It should be possible to provide the same initial population for all runs via configure, but this definitely isn't the main issue. I don't really have an idea what may be the reason other than that maybe the compiler is able to optimize the code better for Vec.

Regarding LBFGS: Here I've also identified a couple of problems. Firstly, in case of ndarray, the parameter vectors are transformed to Vecs via to_vec() (which copies) before feeding them to the cost function. I would expect the compiler to optimize this away, but I'm not sure. I've replaced this with p.as_slice().unwrap(). I did not see any impact, because of the second problem: I was surprised about how quick the benchmarks ran and suspected early termination, which was the case. This can be disabled by configuring the solver this way:

let solver = LBFGS::new(linesearch, m)
    .with_tolerance_grad(0.0)?
    .with_tolerance_cost(0.0)?;

However, this means the solver will continue even if the gradient vanishes, leading to the "Search direction must be ...." error. I was able to run it without errors when reducing the number of iterations to 10. I'm not sure how to solve this properly. I suspect that a different test function might be better. Rosenbrock is known for its long and flat valley where most solvers get stuck/progress slowly. Moving the initial parameter further away from the optimum (in order to allow for more iterations) does not help either in my experience, because most solvers are very quick in finding the valley.
However, the results are quite different then. Vec still performs best, but one could imagine that Vec times increase more with increasing number of parameters than ndarray. This is also what I would expect: For many parameters, Vec should be increasingly worse than ndarray. For a low number of parameters I assume that the compiler is able to optimize a lot.

Here are my results:

lbfgs

lbfgs2

(Note that my machine is quite old and I had other programs running)

To run all benchmarks do:
[...]
The generated report: target/criterion/report/index.html

Thanks, this was super helpful :)

I think the benchmarks are already useful as they are right now, however it might be worthwhile to define a common Problem for all (for example the Rosenbrok problem is very similarly defined in at least 4 benchmarks) that doesn't produce convergence problems to avoid code duplication.

In general I agree with this; however, I'm afraid this may be difficult given the different properties of different solvers. I'd still strive for this as much as possible. What I would also find useful are real-world problems (ideally higher-dimensional problems), as long as this doesn't blow the benchmark times out of proportion.

In the case of an error I've forced the benchmark to panic in order to avoid a failing solver to report an incredibly small time for its execution.

👍

Another thing that might be worthwhile benchmarking is the impact of the loggers in the optimization performance. I might add something in that respect after going through you feedback on this.

This would be very interesting indeed, in particular since the observers interface has caused performance degradation in the past (#112).

I've also modified the bench profile to include debug symbols, this isn't strictly necessary but it generates the necessary info for generating flamegraphs with these benchmarks. (change in top level Cargo.toml)

👍

Let me know what you think on the benchmark approach and the issues, any feedback is welcomed :)

I love it! This is an excellent basis for the upcoming benchmarking journey :)

@jonboh
Copy link
Contributor Author

jonboh commented Nov 20, 2023

Hi @stefan-k, happy to see you back :)

On the point about early termination and setting the cost and grad to 0, if the algorithm is performing at least a small number of iterations I don't think it is really necessary for the benchmark to set those to 0, as once we have a baseline of the performance of the algorithm in solving the problem to a given threshold, any change from that baseline would be significant (as long as the threshold is not changed).
Performance wise I think it would be acceptable to have a solver not reach the solution of the problem and get stuck in a local minima as long as the benchmark is used just to evaluate computational performance (the number crunching), as the trajectories of the algorithms should always be the same. Related to this I've seen that I did not set any seed, in the case of algorithms with any randomnes I'll add it.

Regarding the part about the standard problems I agree that it would be good to have higher dimensional problems to test (ideally based in the real world), your response gave me the final push to publish a crate that I had abandoned some time ago, a rewrite of the GKLS generator. Something like this would allow us to parametrize the benchmarks with the amount of dimensions of the problem or its complexity, they are as synthetic as they come though 😅. I've used this generator in the past to compare algorithms based on the amount of iterations on the cost function.
Another source that comes to mind for sourcing problems is the benchmark test-suit of infinity77. Most of them however are function generators as GKLS that would need to be rewritten into Rust, and still would be synthetic problems. I'm not familiar with any repository of open source real problems.
If you think its ok to add a dev-dependency for the function generator I can generate the benchmarks for it. There's some work on the GKLS functions that I could use to guide the function selection.

For the purpose of this PR I think it is ok to not address the issues with the backends, so we can keep it focused on the benchmarking and preventing performance regressions, by generating the baselines to characterize the algorighms and address the performance peculiarities later on when we have them characterized.

This would be very interesting indeed, in particular since the observers interface has caused performance degradation in the past (#112).

I'll add them 👍

@stefan-k
Copy link
Member

Hi @stefan-k, happy to see you back :)

Thanks :) I was unfortunately only sort-of back, but now I should be able to be more responsive :)

On the point about early termination and setting the cost and grad to 0, if the algorithm is performing at least a small number of iterations I don't think it is really necessary for the benchmark to set those to 0, as once we have a baseline of the performance of the algorithm in solving the problem to a given threshold, any change from that baseline would be significant (as long as the threshold is not changed). Performance wise I think it would be acceptable to have a solver not reach the solution of the problem and get stuck in a local minima as long as the benchmark is used just to evaluate computational performance (the number crunching), as the trajectories of the algorithms should always be the same.

I agree in principle. At least for some solvers I'm a bit afraid that having an insufficient number of iterations may lead to certain code paths not being part of the benchmark. Also, for solvers with line searches, the time spent in the line search may depend on the iteration number. However, I agree that having a baseline is the important part and I'm sure that these concerns aren't something we should bother too much about, just something we should keep in mind in case something isn't as expected.

Related to this I've seen that I did not set any seed, in the case of algorithms with any randomnes I'll add it.

👍

Regarding the part about the standard problems I agree that it would be good to have higher dimensional problems to test (ideally based in the real world), your response gave me the final push to publish a crate that I had abandoned some time ago, a rewrite of the GKLS generator. Something like this would allow us to parametrize the benchmarks with the amount of dimensions of the problem or its complexity, they are as synthetic as they come though 😅. I've used this generator in the past to compare algorithms based on the amount of iterations on the cost function. Another source that comes to mind for sourcing problems is the benchmark test-suit of infinity77. Most of them however are function generators as GKLS that would need to be rewritten into Rust, and still would be synthetic problems. I'm not familiar with any repository of open source real problems. If you think its ok to add a dev-dependency for the function generator I can generate the benchmarks for it. There's some work on the GKLS functions that I could use to guide the function selection.

This is amazing! To be frank, my main motivation for having real-world problems is not so much for benchmarking as for having them as an educational resource for people starting out with argmin (i.e. examples). For benchmarks I think synthetic problems are great, so feel free to add your library as a dev dependency!

For the purpose of this PR I think it is ok to not address the issues with the backends, so we can keep it focused on the benchmarking and preventing performance regressions, by generating the baselines to characterize the algorighms and address the performance peculiarities later on when we have them characterized.

Good point, I absolutely agree!

Thanks again for the work and patience! :) I'll strive to be more responsive from now on :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants