Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should cargo-criterion support baselines? #10

Open
bheisler opened this issue Jul 6, 2020 · 2 comments
Open

Should cargo-criterion support baselines? #10

bheisler opened this issue Jul 6, 2020 · 2 comments

Comments

@bheisler
Copy link
Owner

bheisler commented Jul 6, 2020

I had this idea kicking around that baselines could be replaced with alternate "timelines" (since cargo-criterion will hopefully soon support a full-history chart).

I was never very satisfied with the workflow of Criterion.rs' baselines (and others seem to agree on that - eg. critcmp largely exists to make up for deficiencies in the workflow supported by baselines). Thing is, I have no idea what sort of workflow would work better.

This will require some design work. Probably won't be available in 1.0.0.

@MOZGIII
Copy link

MOZGIII commented Jan 25, 2021

We're currently improving our benchmarking suite at https://github.com/timberio/vector, and we figured is we'd like a way to compare the benchmarking results in the long run.
It may be not exactly the same idea the baselines were designed for, but I just wanted to provide some feedback that it's something we're looking for.

This is tricky, and we're looking into correct ways to implement it. We're currently thinking about storing the bench data output from the PR in question + master together (i.e. both branches being benched on the same system at the same time) for the long term comparison - but we didn't yet figure out the way we want to use this data.


I also find myself in the position where I need to compare seemingly arbitrary benchmarks against each other. For instance, I have a suite where the same benching code is run in different "environments" (i.e. tracing on and tracing off), and sometimes we iterate on those layers. I find myself in need to compare both two benches against each other, and also two versions of the code against each other.

I.e. the flow looks like this:

$ git checkout master
$ cargo bench
# produces:
# - env1/bench1
# - env1/bench2
# - env2/bench1
# - env2/bench2
# (to be used as base)
$ git checkout mychange
$ cargo bench
# produces:
# - env1/bench1
# - env1/bench2
# - env2/bench1
# - env2/bench2
# (to be used as new)
$ critcmp (or similar)

It would make sense to compare the bench1 across base/env1, base/env2, new/env1 and new/env2, rather than how critcmp does it currently: compares env1/bench1 across base and new.

Does it make sense?


I hope this feedback will be helpful. If you'd like to chat - we're at http://chat.vector.dev/

cc @jszwedko

@BurntSushi
Copy link

BurntSushi commented Mar 16, 2021

I was never very satisfied with the workflow of Criterion.rs' baselines (and others seem to agree on that - eg. critcmp largely exists to make up for deficiencies in the workflow supported by baselines). Thing is, I have no idea what sort of workflow would work better.

So I thought I might just explain my workflow here. Seeing this issue now made me remember an email I got from you that I never responded to. :-( Sorry about that. It slipped down my inbox and I ended up forgetting about it.

I'll do my best to explain my workflow. I've been using this kind of flow for a long time.

So basically, I start off by running all the benchmarks and save their output. I usually call this master or baseline or something. It's what I compare all future runs with. Then I'll go and make some changes, run the benchmarks again, and maybe call them foo, where foo is some short descriptor related to that change. e.g., simdavx2 or something. I'll then run critcmp baseline simdavx2 to look at comparisons between them. Then I might hone in on a particular benchmark or set of benchmarks. Then I start running, e.g., cargo bench memchr/crate/shortinput -- --save-baseline simdavx2-shortinput and try to tune it. I then use things like critcmp baseline simdavx2 simdavx2-shortinput to see the progression of the benchmark over the different attempts. As you might imagine, things can get pretty fluid here. I might want to compare lots of different runs.

But there are other workflows too. Only being able to compare benchmarks with the same name across distinct runs is incredibly limiting. I also want to be able to compare benchmarks within runs. For example, I might have memchr/crate/shortinput and memchr/libc/shortinput, where the former is my implementation and the latter is something else that I'm trying to match or beat or whatever. But they have different names. With critcmp, I can just do critcmp baseline -g 'memchr/.*?/(.*)' and it will do the correct grouping for me.

And there's also the presentation aspect:

  1. I want it to work easily on the CLI with minimal friction.
  2. When doing comparison, I want the output to be succint. One line per benchmark.

Happy to answer any questions about my workflow. It's a little hard to describe, so I'd be happy to elaborate on any unclear points or why I didn't use X feature in Criterion. (It is plausible that I didn't know about X, whatever it is. :-))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants