Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

account for variance in samples #10

Open
bengland2 opened this issue Feb 12, 2020 · 17 comments
Open

account for variance in samples #10

bengland2 opened this issue Feb 12, 2020 · 17 comments

Comments

@bengland2
Copy link

Current implementation of touchstone calculates averages and then compares them. This approach does not take into account variation in samples for baseline, and variation in new SUT. So you cannot tell if the change in average is statistically significant. There are established statistical methods for incorporating variance into the comparison, as described here:

https://mojo.redhat.com/docs/DOC-1089994

which is basically describing how to use scipy.ttest_ind() function. It would be also good to monitor the % deviation of the baseline and new-run samples to see whether we can determine whether a regression has occurred or not. This kind of analysis can prevent false positives and negatives and avoid wasting time on unnecessary investigations.

@inevity
Copy link

inevity commented Sep 8, 2021

Can not access the link above. Can you post here?

@bengland2
Copy link
Author

@inevity sorry that link is not available anymore, mojo is gone and wasn't available outside Red Hat anyway. Here is the article:

simple performance regression pass_fail script _ Mojo.pdf

@inevity
Copy link

inevity commented Sep 18, 2021

@inevity
Copy link

inevity commented Sep 18, 2021

The t-test 's assumption is that

  1. Normally distributed data
  2. IID samples
  3. Homogeneity of variance
    see https://stanford-cs329s.github.io/slides/cs329s_13_slides_monitoring.pdf

So the sample data should be created by the stable workload, does it?
As for the avg comparision impl in the current master, what is the assumptions?

@bengland2
Copy link
Author

@inevity ,

  • normally distributed data - haven't experimentally demonstrated this, though I consider it likely
  • IID samples - that's part of why I introduced cache dropping to benchmark-operator, because I wanted samples to be IID for storage benchmarks.
  • homogeneity of variance - not familiar with this term, your link doesn't help.

So if you don't use a T-test, what's an alternative method for comparing 2 sets of samples to see if they are truly different from a statistical perspective? Just comparing averages is useless.

Here's a better online link about T-test (my original reference was Raj Jain's classic text " the Art of Computer Systems Performance Analysis", which is about 30 years old, but statistics hasn't changed that much in this area AFAICT).

@inevity
Copy link

inevity commented Sep 23, 2021

Z-tests are closely related to t-tests, but t-tests are best performed when an experiment has a small sample size, less than 30.
https://www.investopedia.com/terms/z/z-test.asp
Do we need consider this case?

@bengland2
Copy link
Author

@inevity I don't think z-test sounds useful. Why? It is usually expensive in time and resources to generate a single sample, and we have many data points to cover, so in my experience we typically limit them to 3 samples for each data point. The standard deviation is barely meaningful with such a small set of samples but it's better than nothing (i.e. just comparing averages). The T-test at least takes the variance in samples into account and gives you some idea of whether you can be confident in saying that the two sets of tests have a significant difference in result. The script I linked to in the initial post makes it easy to try it out and see for yourself how well it works. See if you agree with its conclusions.

@bengland2
Copy link
Author

@inevity sorry I don't understand your last reply. Which benchmark from google? And if you are saying I'm assuming a normally-distributed set of results, I think I'm guilty of that. Perhaps I'll have to put this to the test. But still I think it's better than comparing averages of samples without regard for std. deviation. Don't let the best be the enemy of the better.

@inevity
Copy link

inevity commented Oct 7, 2021

google/benchmark#593 this pr use u-test to compare two sample.
and the debate is that 'More specifically still, a t-test is only useful when we have normally distributed results to compare. Do you have any reason to assume a priori that the distributions of benchmark repetition results are normally distributed? It's true that means of samples from a distribution (which is what we're talking about) tend towards normal distribution (thanks, central limit theorem!), but how quickly, and how large each sample needs to be, and how many samples you need, depends on how skewed the original data is, iirc'.
So they use the u-test to compare. and the u-test is not to compare average. It just have no assumption that the distributed mode and varianceI

@bengland2
Copy link
Author

@inevity Now I understand what you are talking about. I've never heard of a u-test, something new for me. Here is the scipy package that you are referring to. This is an interesting proposal, I'll have to read about it and think about it a little more but I'm not that attached to using a T-test, I just was shopping for a statistical test that accounts for variance in comparing 2 sample sets, and python scipy implemented a T-test, but if u-test does same thing without assumptions about distribution of test results, then it sounds useful to me.

@mfleader
Copy link

mfleader commented Oct 25, 2021

  1. Hypothesis testing for statistical significance is one of main sources for the statistical crisis in science.
  2. You can use a general linear model to replace the t-test comparison of means
  3. You can use a generalized linear model to change the normality assumption
  4. The Mann-Whitney U test is a special case of a proportional model (which is to say it is still a generalized linear model)
  5. If you don't use hypothesis testing and statistical significance, you have to come up with a decision function that you're optimizing with a model parameter that you've estimated from your data sample.
  6. A lot of data related to computers is multimodal, and most out-of-box statistical models that we have access to assume unimodal (though, I think we can still glean some insight if we're careful)

@mfleader
Copy link

I prefer Bayesian methods for estimating generalized linear models of computer performance data, but frequentist non-parametric and semi-parametric models, like the U test, seem to have their use cases for recovering parameters of interest.

@bengland2
Copy link
Author

@mfleader I'm not sure what you mean by "generalized linear model", sorry. What's the simplest yet most reliable way of doing this? I thought u-test wasn't assuming anything about the distribution, unlike t-test which assumes normal distribution? So what decision function would you use? This seems extremely difficult to come up with, since you have to estimate it from your data sample, when we are trying to write code that is known to work without regard for the data sample.

@baul, I tried your mannwhitneyu() and I just replaced ttest_ind() with it, which means I already have a way to experiment with it and compare the two. They don't give the same answers, interestingly. When I get more time I'll try to run both against some real experimental data and see what happens.

@mfleader
Copy link

mfleader commented Oct 26, 2021

The decision function would be a function of the parameter we're testing, and it would compute something either we care about or the business cares about, like if there were some function that computed how much money it would cost the user for each potential microsecond increase in latency. I don't know enough about estimating costs in performance for a cloud platform to actually write that function, so I have ignored it by using the identity function or a negative identity function as a decision, or cost function. For example, I would use the identity function for estimating differences in latency because larger values are worse which translates to the group with the higher latency costing more because the cost function outputs a higher value for it. In general, you already do some of this when you're thinking about the cost to performance given a software change.

@mfleader
Copy link

mfleader commented Oct 26, 2021

With regards to the Mann-Whitney U Test, I was pointing out that it is arguably a special case of a proportional odds model, to say that we cannot entirely avoid assumptions about our data. We just need to understand and clearly communicate the consequences of the models that we choose to use. Given the opportunity, I believe we would want to use the more powerful statistical model, as in a general linear model, instead of a t-test or a Mann-Whitney U test.

@mfleader
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants