qsort compare functions

Matthias Andree matthias.andree at gmx.de
Fri Nov 22 18:46:49 CET 2002


David Relson <relson at osagesoftware.com> writes:

> Last night after changing _my_ compare routines to the code I want, I
> did two things.
>
> First, I modified the qsort compare routines to do each comparison twice
> -
> the current way (with strcmp as the first check) and the new way (with
> strcmp as the last check) - and to exit(3) if the results differed.  I
> then ran my October corpus of 5,871 messages received through
> bogofilter.  No differences were encounterd.

So you have an example of it working either way on x86. That's not a
proof.

> If I combined my good and ham word lists, I'd have approx 500,000 words
> from approx 43,000 messages.  Given MAX_REPEATS of 4 for Graham, and 1
> for Robinson, a token has a maximum value of 172,000, i.e. 43,000 * 4.
> Two tokens will have either the same word counts or they won't.  With
> the max value of 172,000 two words with different counts will have
> probabilities that differ by at least 1/172,000, i.e. 5.81e-06.
>
> This bit of math proved to me that two words with different counts will
> have distinctly different probabilities.  So the compare routines do not
> need an EPS test to catch a minute difference.

Are you confident that all systems actually implement double as being 64
bit wide? In limits for ANSI-C (C89), we have the maximum DBL_EPSILON as
low as 1e-9. However, some systems have non-conforming double with
precision as low as float.

No-one has done research so far how far EPSILON and rounding
difficulties propagates, the point is: we have headroom up to 1e-7 on
these machines. That's razor sharp already. If any critical
additions/subtractions take place, the whole calculation is hosed.

> So, with respect to qsort, we can do the simple thing - sort the tokens
> by probabilities and, if not different, sort by alphanumeric value.

I'd prefer correctness over saving two lines of code. We need to be
portable, and on systems that do not support a native representation for
"exactly 0.0", we may catch nasty surprises.

-- 
Matthias Andree



More information about the bogofilter-dev mailing list