custom cdfchi implemented - do we need GSL? (was: HEADS-UP: dcdflib migration to GSL.)

Matthias Andree matthias.andree at gmx.de
Mon Jul 21 03:18:35 CEST 2003


Matthias Andree <matthias.andree at gmx.de> writes:

> Numerical Recipes in C has quite compact stuff for the \chi^2
> (cumulative) distribution function, but their code "cannot be
> redistributed", and apparently they want me to buy a license to use that
> code (which - I think - isn't possible because it equals limiting the
> consumer's rights after the purchase, such clauses are automatically
> void by German debt legislation, if it's at all possible to claim
> copyrights for implementing an algorithm found many decades ago by a
> mathematician.)

Well, I've looked at some of that stuff a bit closer. We only need a
special case of the cdfchi function for our purposes, because the
"degrees of freedom" is always an even positive integer or zero (0, 2,
4, 6...). This allows some algorithmic transformations. I don't know if
this is faster than integrating things, but at least it's only a
screenful of lines :-)

http://www.itl.nist.gov/div898/handbook/eda/section3/eda3666.htm

This mentions F(x) = \frac{\gamma(\nu/2, x/2)}{\Gamma(\nu/2)} where F(x)
is the cumulative distribution function of the chi^2 distribution.

We observe in fisher.c:fis_get_spamicity that df will always be 2 *
robn, with robn being a natural number. Therefore, our Gamma function
(complete and incomplete) will always have an integer argument.

F(x) = \frac{\gamma(robn, x/2)}{\Gamma(robn)}

     = \frac{\gamma(robn, x/2)}{(n-1)!}

http://mathworld.wolfram.com/IncompleteGammaFunction.html

This site has a nice-looking formula for \gamma(n,x):

                     /                                       \
\gamma(n,x) = (n-1)! |1 - exp(-x) sum_(k_0)^(n-1) [ x^k/k! ] |
                     \                                       /

As we want Q rather than P, we can cancel down the (n-1)! and all that
remains is

Q(2x, 2n) = exp(-x) sum_(k_0)^(n-1) [ x^k/k! ]

This is easy to implement, but requires the usual "keep logarithmised
values" trick: for large x, I cannot multiply with exp(-x) later because
this term becomes 0 with machine precision.

The code is below. I tested this stuff on approximately 3,400 mails,
that is, 6,752 cdfchi calls. I also stuffed the mailbox as a single
mail to see how the function deals with 2e5 for degrees of freedom. It's
not exactly fast in that case.

The absolute and relative error is between 0 and 2.6e-11. There are
occasions when dcdflib's cdfchi returns 0 - I believe it cheats and
special cases something like 1e-46 <- 0.

I wonder if we need to benchmark this stuff, because the DF can be quite
large, leading to (typically) hundreds to thousands of iterations
involving exp() and log(). I don't currently see a way to save one of
these calls, any other approach I've tried hasn't been numerically
stable.

There may be cases where approximating integrals is quicker,
particularly for large degrees of freedom... (but maybe there's a way to
approximate the sum we're having as well, dunno, haven't looked yet),
but I wonder if it's worth the effort because we have the very expensive
lexer as well, and one log + exp per token extra looks acceptable to me.

There is probably also a way to approximate the sum, but we can look
into that if this code turns out to be too slow. Not recommended with
software floating point emulation. ;-)

Anyways, here's the fodder, *please comment*. (not in CVS yet)

/* calculate complement of cumulative distribution function of chi^2
 * distribution, 1-P(x,nu). This function expects nu to be an even
 * number so that our incomplete Gamma function sum works! */

double macompcdfchi2(double x, int nu) {
    int k;
    /* halve arguments */
    int robn = nu / 2;
    double gx = x / 2;
    double lngx = log(gx);
    double lnqi = -gx;
    double s = exp(-gx);

    if (nu % 2) {
	abort();
    }

    for (k = 1; k <= robn-1; k++) {
	lnqi += lngx - log(k);
	s += exp(lnqi);
    }
    return s;
}



My current prbf looks like this:

double prbf(double x, double df)
{
    int which=1;
    double p, q;
    double q2;
    int status;
    double bound;

    /* pass in x, df; want p and q; return q */
    cdfchi(&which, &p, &q, &x, &df, &status, &bound);

    q2 = macompcdfchi2(x, df);

    fprintf(stderr, "x=%g, df=%g => prbf: %g, mycdf: %g, abs: %g, rel: %g%%\n",x,df,q, q2,
            fabs(q-q2), fabs(q-q2)/(max(fabs(q), fabs(q2)))*100);

    return(status==0 ? q : 1.0);
}

-- 
Matthias Andree




More information about the bogofilter-dev mailing list