Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

geomean(::AbstractInterval) is not correct #173

Closed
hyrodium opened this issue Jan 29, 2024 · 5 comments · Fixed by #174
Closed

geomean(::AbstractInterval) is not correct #173

hyrodium opened this issue Jan 29, 2024 · 5 comments · Fixed by #174

Comments

@hyrodium
Copy link
Collaborator

julia> using IntervalSets, StatsBase

julia> a, b = 5, 10
(5, 10)

julia> geomean(a..b)  # Should be 7.3575...
7.0710678118654755

julia> geomean(range(a,b,length=10000))
7.357559595510975

julia> exp((b*log(b)-a*log(a)-b+a)/(b-a))
7.357588823428852

geomean(::AbstractInterval) was introduced in #171, but the definition sqrt(leftendpoint(d) * rightendpoint(d)) seems not correct.

$$ \mathop{\text{GM}}[a..b] = \exp\left(\frac{1}{b-a} \int_a^b \log(x) dx \right) = \exp\left(\frac{b\log(b)-a\log(a)-b+a}{b-a}\right) $$

See Geometric mean (wikipedia) for more information.

This correct definition has the following consistency with mean(range(...))

julia> using IntervalSets, StatsBase, Plots

julia> plot([geomean(range(a,b,length=n)) for n in 2:1000], label="geomean of evenly spaced points")

julia> hline!([exp((b*log(b)-a*log(a)-b+a)/(b-a))], label="geomean of an interval set")

geomean

@aplavin
Where did you find the original definition in https://github.com/JuliaMath/IntervalSets.jl/pull/171/files#diff-29865909ee404f36758d3fb0d2d549a636b7c5cd4967d9855c245f59b3a0bb67R6?

@aplavin
Copy link
Contributor

aplavin commented Jan 31, 2024

My motivation was that:

  • mean(interval) is its midpoint on the linear scale, geomean() is its midpoint on the logscale
  • mean(interval) = mean(endpoints(interval)), same makes sense for geomean()

I wonder what usecases you have, where the current geomean definition doesn't give the expected result. Don't really understand the connection between mean/geomean of an interval and mean/geomean of a function (that you linked). I don't think there is unambiguous meaning of "(geo)mean of an interval" in maths in general, so here we just choose the most pragmatic one.
Just looked up my usage of geomean(interval) over the past years (I defined it myself quite a few times, and have to continue doing so even now due to #168). All uses are for plotting: find the midpoint of the interval when plotted on the logscale.

@hyrodium
Copy link
Collaborator Author

I see. Thank you for the explanation.

I don't think there is unambiguous meaning of "(geo)mean of an interval" in maths in general, so here we just choose the most pragmatic one.

I think mean(a..b) should be defined by $E[X], (X \sim \text{Uniform}(a,b))$. This definition can be regarded as a centroid of an interval, and it can be generalized to a non-connected subset of $\mathbb{R}$.

Don't really understand the connection between mean/geomean of an interval and mean/geomean of a function (that you linked).

How about this comment (https://stats.stackexchange.com/a/386476)? Similarly to mean(a..b), I think geomean(a..b) should be defined by $\exp(E[\log(X)]), (X \sim \text{Uniform}(a,b))$. The geomean function is owned by StatsBase.jl, so I believe we need statistics theory to add methods to this function.

I wonder what usecases you have, where the current geomean definition doesn't give the expected result.

I actually don't have use cases for geomean for now. I just thought the inconsistent definition discussed above should not be provided from this package.

All uses are for plotting: find the midpoint of the interval when plotted on the logscale.

Then, how about adding a new function mean_logscale(::AbstractInterval) instead of adding a geomean(::AbstractInterval) method?

@aplavin
Copy link
Contributor

aplavin commented Jan 31, 2024

How about this comment (https://stats.stackexchange.com/a/386476)?

That again relates to mean/geomean of random variables, not of intervals. Basically, that's how [geo]mean(Uniform(interval)) or [geo]mean(Uniform(union_of_intervals)) should work.
Actually, I proposed to add Uniform(interval) constructors, but somehow Distributions.jl devs don't want them (JuliaStats/Distributions.jl#1809).

Similarly to mean(a..b), I think geomean(a..b) should be defined by <...>.

That's definitely a non-starter. I'm sure many potential users (a large fraction, not a large absolute number :) ) would be surprised if geomean(1..100) works (doesn't error) and returns something other than 10.

With geomean and collections, it has a nice property that exp(mean(X)) == geomean(exp.(X)). The current definition of geomean(interval) also follows this property (conceptually, because literal exp.(X) isn't supported for intervals). Just remember that "interval" is not a "probability distribution".

I just thought the inconsistent definition discussed above should not be provided from this package.

I wouldn't call it inconsistent, but agree that it's probably best not to define it here. As this thread shows, there are at least two different meanings geomean(interval) can have, and both are consistent with other functions in different ways.
So, just reverting 6074974 should be fine, and is the obvious solution.

@timholy
Copy link
Member

timholy commented Jan 31, 2024

I agree that the notion proposed by @hyrodium is the more mathematically correct one. Of course, for both mean and geomean, there's an implicit sense of measure, i.e., that we're computing, e.g.,

$$ \frac{\int x dx}{\int dx} $$

rather than

$$ \frac{\int x w(x) dx}{\int w(x) dx} $$

Strictly speaking that's not encoded by just knowing the endpoints of the interval, and it's not invariant to changes of coordinates, i.e., mean(f.(a..b)) != f(mean(a..b)).

@aplavin
Copy link
Contributor

aplavin commented Jan 31, 2024

From the discussion in this thread, it seems clear that there are at least two different conflicting meanings one could assign to geomean(interval). And there is no definite meaning assigned to "geomean of an interval" in maths in general.
These are totally reasonable arguments to revert 6074974, and recommend geomean(endpoints(interval)) or geomean(Uniform(interval)) instead, depending on which meaning is relevant in each case. (the latter is now geomean(Uniform(endpoints(interval)...)), but I'm hopeful that at some point Distributions will support intervals).

mean(interval) technically shares the same concerns, but both definitions happen to give the same answer there – so there are no issues with that method staying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants