-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
account for variance in samples #10
Comments
Can not access the link above. Can you post here? |
@inevity sorry that link is not available anymore, mojo is gone and wasn't available outside Red Hat anyway. Here is the article: |
The t-test 's assumption is that
So the sample data should be created by the stable workload, does it? |
@inevity ,
So if you don't use a T-test, what's an alternative method for comparing 2 sets of samples to see if they are truly different from a statistical perspective? Just comparing averages is useless. Here's a better online link about T-test (my original reference was Raj Jain's classic text " the Art of Computer Systems Performance Analysis", which is about 30 years old, but statistics hasn't changed that much in this area AFAICT). |
Z-tests are closely related to t-tests, but t-tests are best performed when an experiment has a small sample size, less than 30. |
@inevity I don't think z-test sounds useful. Why? It is usually expensive in time and resources to generate a single sample, and we have many data points to cover, so in my experience we typically limit them to 3 samples for each data point. The standard deviation is barely meaningful with such a small set of samples but it's better than nothing (i.e. just comparing averages). The T-test at least takes the variance in samples into account and gives you some idea of whether you can be confident in saying that the two sets of tests have a significant difference in result. The script I linked to in the initial post makes it easy to try it out and see for yourself how well it works. See if you agree with its conclusions. |
And u test have no assumption that the distributed mode and variance. So maybe more proper? |
@inevity sorry I don't understand your last reply. Which benchmark from google? And if you are saying I'm assuming a normally-distributed set of results, I think I'm guilty of that. Perhaps I'll have to put this to the test. But still I think it's better than comparing averages of samples without regard for std. deviation. Don't let the best be the enemy of the better. |
google/benchmark#593 this pr use u-test to compare two sample. |
@inevity Now I understand what you are talking about. I've never heard of a u-test, something new for me. Here is the scipy package that you are referring to. This is an interesting proposal, I'll have to read about it and think about it a little more but I'm not that attached to using a T-test, I just was shopping for a statistical test that accounts for variance in comparing 2 sample sets, and python scipy implemented a T-test, but if u-test does same thing without assumptions about distribution of test results, then it sounds useful to me. |
|
I prefer Bayesian methods for estimating generalized linear models of computer performance data, but frequentist non-parametric and semi-parametric models, like the U test, seem to have their use cases for recovering parameters of interest. |
@mfleader I'm not sure what you mean by "generalized linear model", sorry. What's the simplest yet most reliable way of doing this? I thought u-test wasn't assuming anything about the distribution, unlike t-test which assumes normal distribution? So what decision function would you use? This seems extremely difficult to come up with, since you have to estimate it from your data sample, when we are trying to write code that is known to work without regard for the data sample. @baul, I tried your mannwhitneyu() and I just replaced ttest_ind() with it, which means I already have a way to experiment with it and compare the two. They don't give the same answers, interestingly. When I get more time I'll try to run both against some real experimental data and see what happens. |
The decision function would be a function of the parameter we're testing, and it would compute something either we care about or the business cares about, like if there were some function that computed how much money it would cost the user for each potential microsecond increase in latency. I don't know enough about estimating costs in performance for a cloud platform to actually write that function, so I have ignored it by using the identity function or a negative identity function as a decision, or cost function. For example, I would use the identity function for estimating differences in latency because larger values are worse which translates to the group with the higher latency costing more because the cost function outputs a higher value for it. In general, you already do some of this when you're thinking about the cost to performance given a software change. |
With regards to the Mann-Whitney U Test, I was pointing out that it is arguably a special case of a proportional odds model, to say that we cannot entirely avoid assumptions about our data. We just need to understand and clearly communicate the consequences of the models that we choose to use. Given the opportunity, I believe we would want to use the more powerful statistical model, as in a general linear model, instead of a t-test or a Mann-Whitney U test. |
Current implementation of touchstone calculates averages and then compares them. This approach does not take into account variation in samples for baseline, and variation in new SUT. So you cannot tell if the change in average is statistically significant. There are established statistical methods for incorporating variance into the comparison, as described here:
https://mojo.redhat.com/docs/DOC-1089994
which is basically describing how to use scipy.ttest_ind() function. It would be also good to monitor the % deviation of the baseline and new-run samples to see whether we can determine whether a regression has occurred or not. This kind of analysis can prevent false positives and negatives and avoid wasting time on unnecessary investigations.
The text was updated successfully, but these errors were encountered: