Question

I am looking for a java library / implementation which supports the calculation of the inverse cumulative distribution function for the beta distribution (aka estimation of quantiles) with reasonable precision.

Of course I have tried apache commons math, but in version 3 there still seem to be some issues with the precision. Below the problem which lead to this question is described extensively.


Suppose I want to calculate the credible interval of a beta distribution with a lot of trials. In apache commons math ...

final int trials = 161750;
final int successes = 10007;
final double alpha = 0.05d;

// the supplied precision is the default precision according to the source code
BetaDistribution betaDist = new BetaDistribution(successes + 1, trials - successes + 1, 1e-9);

System.out.println("2.5 percentile :" + betaDist.inverseCumulativeProbability(alpha / 2d));
System.out.println("mean: " + betaDist.getNumericalMean());
System.out.println("median: " + betaDist.inverseCumulativeProbability(0.5));
System.out.println("97.5 percentile :" + betaDist.inverseCumulativeProbability(1 - alpha / 2d));

which delivers

2.5 percentile :0.062030402074808505
mean: 0.06187249616697166
median: 0.062030258659508855
97.5 percentile :0.06305170793994147

The issues is that the 2.5 percentile and median are the same meanwhile both greater than the mean.

In comparison, the R-package binom delivers

binom.confint(10007+1,161750+2,methods=c("agresti-coull","exact","wilson"))
         method     x      n      mean      lower      upper
1 agresti-coull 10008 161752 0.0618725 0.06070873 0.06305707
2         exact 10008 161752 0.0618725 0.06070317 0.06305756
3        wilson 10008 161752 0.0618725 0.06070877 0.06305703

and the R-package stats

qbeta(c(0.025,0.975),10007+1,161750-10007+1)
[1] 0.06070355 0.06305171

To second the results from R, here is what Wolfram Alpha told me

Final note on the requirements:

  • I need to run a lot of these calculations. Hence any solution should not take longer than 1s (which is still a lot compared to the 41ms of (albeit wrong) apache commons math).
  • I am aware that one can use R within java. For reasons I won't elaborate here, this is the last option if anything else (pure java) fails.

Update 21.08.12

It seems that the issue has been fixed or at least improved in 3.1-SNAPSHOT of apache-commons-math. For the usecase above

2.5 percentile :0.06070354581340706
mean: 0.06187249616697166
median: 0.06187069085946604
97.5 percentile :0.06305170793994147

Update 23.02.13

While at first glance this question and it's responses may be too localized, I think that it very well illustrates that some numerical problems cannot be solved (efficiently) with a what-first-comes-to-mind-hacker-approach. So I hope it remains open.

Était-ce utile?

La solution

The issue has been fixed in apache commons math 3.1.1

The testcase above delivered

2.5 percentile :0.06070354581334864
mean: 0.06187249616697166
median: 0.06187069085930821
97.5 percentile :0.0630517079399996

which matches the results from the r-package stats. Extensive application of 3.1-SNAPSHOT + x versions also did not cause any problems.

Autres conseils

Most probably, this problem cannot be solved generically, since if the graph of the cumulative distribution function is very flat (which it usually will be towards the tails of the distribution), a very high precision on the vertical axis is needed to reach a reasonable precision on the horizontal axis.

Hence it will always be better to use a function calculating the quantiles directly than deriving the quantiles from the cumulative distribution function.

If you do not worry about precision, you can, of course, solve the equation q = F (x) numerically. Since F is increasing, that's not difficult:

   double x_u = 0.0;
   double x_l = 0.0;

   // find some interval quantile is in
   if ( F (0.0) > q) {
      while ( F (x_l) > q) {
         x_u = x_l;
         x_l = 2.0 * x_l - 1.0;
      }
   } else {
      while ( F (x_u) < q) {
         x_l = x_u;
         x_u = 2.0 * x_u + 1.0;
      }
   }

   // narrow down interval to necessary precision
   while ( x_u - x_l > precision ) {
      double m = (x_u - x_l) / 2.0;
      if ( F (m) > q ) x_u = m; else x_l = m;
   }     
   // quantile will be within [x_l; x_u]

Remark: It is not clear to me why precision should be a problem particularly for the beta distribution, since the beta distribution lives on the interval [0;1] and the graph is rather steep towards the ends of the interval.

Second remark: Your calculation of the upper quantile is wrong; it should read

System.out.println( "97.5 percentile :" + betaDist.inverseCumulativeProbability( 1 - alpha / 2d ) );

Third edit: Algorithm corrected.

I have found and tried the library JSci (Version 1.2 27.07.2010)

Code snippet:

final int trials = 162000;
final int successes = 10000;
final double alpha =0.05d;

BetaDistribution betaDist = new BetaDistribution(successes + 1, trials - successes + 1);
long timeSum = 0;
for(double perc : new double[]{alpha/2,0.5,1-alpha/2}){
    long time = System.currentTimeMillis();
    System.out.println((perc*100) + " percentile :" + betaDist.inverse(perc));
    timeSum += System.currentTimeMillis()-time;
}
System.out.println("Took ~" + timeSum/3 + " per call");

which returned

2.5 percentile :0.060561615036184686
50.0 percentile :0.06172659147924378
97.5 percentile :0.06290542466617127
Took ~2ms per call

Internally a root-finding-approach is used as suggest by JohnB. One can extend ProbabilityDistribution#inverse to request more precision. Unfortunately, even with tons of iterations (100k) and a requested precision of 10^-10 the algorithm still returns

2.5 percentile :0.06056698485628473
50.0 percentile :0.06173200221779383
97.5 percentile :0.06291087598052053
Took ~564ms per call

Now: Whose code is less wrong ? R or JSci ? I'd favor the one with the bigger user base ...

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top