qsort compare functions

Fri Nov 22 19:00:42 CET 2002

<x-flowed>

At 12:46 PM 11/22/02, Matthias Andree wrote:

>So you have an example of it working either way on x86. That's not a
>proof.

True.  It's a demonstration

> > If I combined my good and ham word lists, I'd have approx 500,000 words
> > from approx 43,000 messages.  Given MAX_REPEATS of 4 for Graham, and 1
> > for Robinson, a token has a maximum value of 172,000, i.e. 43,000 * 4.
> > Two tokens will have either the same word counts or they won't.  With
> > the max value of 172,000 two words with different counts will have
> > probabilities that differ by at least 1/172,000, i.e. 5.81e-06.
> >
> > This bit of math proved to me that two words with different counts will
> > have distinctly different probabilities.  So the compare routines do not
> > need an EPS test to catch a minute difference.
>
>Are you confident that all systems actually implement double as being 64
>bit wide? In limits for ANSI-C (C89), we have the maximum DBL_EPSILON as
>low as 1e-9. However, some systems have non-conforming double with
>precision as low as float.
>
>No-one has done research so far how far EPSILON and rounding
>difficulties propagates, the point is: we have headroom up to 1e-7 on
>these machines. That's razor sharp already. If any critical
>additions/subtractions take place, the whole calculation is hosed.
>
> > So, with respect to qsort, we can do the simple thing - sort the tokens
> > by probabilities and, if not different, sort by alphanumeric value.
>
>I'd prefer correctness over saving two lines of code. We need to be
>portable, and on systems that do not support a native representation for
>"exactly 0.0", we may catch nasty surprises.

I'm not advocating removing the epsilon checks, i.e. "fabs(a-b) < 
EPS".  I'm asserting that the sort routines, whose purpose is to generate 
pretty (ordered) output for people to read, don't need to worry about 
EPS.  bogofilter's computations are not affected by the order.

My math proved to my satisfaction that, for reasonably sized word lists 
(less than tens of millions of tokens), if two tokens have the same 
probability, then they have the same counts in spamlist.db and goodlist.db. 
Rounding isn't an issue for the token probabilities.

I'm not worried about encountering a system that can't handle the precision 
needed for sorting.
</x-flowed>