Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance compared to ForwardDiff #121

Open
arnauqb opened this issue Apr 12, 2024 · 7 comments
Open

Performance compared to ForwardDiff #121

arnauqb opened this issue Apr 12, 2024 · 7 comments

Comments

@arnauqb
Copy link

arnauqb commented Apr 12, 2024

In this small example:

using ForwardDiff, StochasticAD, BenchmarkTools

function test(x, y, alpha)
    x = x * alpha
    return sum(x .* y)
end

x = rand(10000);
y = rand(10000);
alpha = 2.0;
@btime ForwardDiff.derivative(alpha -> test(x, y, alpha), alpha); # 44.045 μs (7 allocations: 312.67 KiB)
@btime StochasticAD.derivative_estimate(alpha -> test(x, y, alpha), alpha); # 727.524 μs (40028 allocations: 1.83 MiB)

One can see a ~10x difference in performance between ForwardDiff and StochasticAD. I am currently using StochasticAD for big models and it is causing a bit of a bottleneck. I would expect that both backends would have similar performance in this case since there is no discrete stochasticity involved.

Is there a way to reduce the number of allocations?

Any help would be appreciated!

@gaurav-arya
Copy link
Owner

Hi, thank you for the report:) It's a bit of a hectic time, so I just wanted to let you know that it may be a few weeks before I can get the chance to deeply examine your case and implement the performance optimizations described below.

Briefly: the slowdown is very likely the fault of the mutable state used in implementing the pruning backend: to see this, try running your example with backend = SmoothedFIsBackend(). For the particular case you've posted, I could very likely performance optimize it by avoiding creating a mutable state when there are no discrete perturbations, as a special case.

If your real, general problem does have discrete perturbations in all your triples, this wouldn't solve that case -- however, I've been meaning to revisit the way these "mutable" states work anyway and thus performance optimize the general case too :) But in the current design, the 10x slowdown over ForwardDiff is indeed to be expected :/

@arnauqb
Copy link
Author

arnauqb commented Apr 15, 2024

Thank you for your quick and detailed answer. My current use case looks like:

  1. Sample x from a vector of Bernoullis
  2. Run x through an expensive but deterministic model.

I guess that even though the expensive part does not have randomness, x will still have a discrete perturbation component that will need to be combined in 2. so it would not work...

I'll wait patiently for the update then :)

@gaurav-arya
Copy link
Owner

gaurav-arya commented Apr 15, 2024

Ah, you could try registering your deterministic model as a single StochasticAD primitive via https://gaurav-arya.github.io/StochasticAD.jl/dev/devdocs.html#via-StochasticAD.propagate and see if that yields any speedup 🙂

@arnauqb
Copy link
Author

arnauqb commented Apr 15, 2024

Thanks for pointing me to propagate, I did not know about it. It did actually caused a speed-up for the expensive part of the model, but I realized that the bottleneck is caused by other simple operations between large vectors of triples, like the one in the original post.

On another topic, and perhaps I should open a new issue for this, how difficult would it be to implement GPU support for stochastic triples?

@gaurav-arya
Copy link
Owner

A new issue for that would definitely be appropriate! I don't know much about GPUs, but my guess is that it would be important to write rules for vector operations (e.g. using StochasticAD.propagate), rather than only scalar code as in StochasticAD currently, as scalar code would be slow on a GPU? But perhaps I'm wrong about that... In particular, I wonder whether or not something like map on a GPU array using a scalar f would be slow or fast. I imagine ForwardDiff's Dual numbers would present a similar problem -- I wonder whether or not they currently play well with GPUs?

@arnauqb
Copy link
Author

arnauqb commented Apr 16, 2024

ok I may have a go at this and open an issue once I made a bit of progress

@Moelf
Copy link
Collaborator

Moelf commented Apr 16, 2024

In particular, I wonder whether or not something like map on a GPU array using a scalar f would be slow or fast.

it would be fast if the scalar function f is written with only "simple" operations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants