qsort compare functions
relson at osagesoftware.com
Fri Nov 22 07:48:04 EST 2002
At 06:21 AM 11/22/02, Matthias Andree wrote:
>Object. This assumes we have a clear == semantic for double, which we
>don't really have. The EPS stuff is there to kill rounding errors.
>My code checks if d1->prob and d2->prob are close enough to each other
>to consider them equal.
>Remember: (1.0/3)*3 == 1.0 does not hold, and therefore < or > may match
>even though the values are "equal" in the rounding error constraints
Last night after changing _my_ compare routines to the code I want, I did
First, I modified the qsort compare routines to do each comparison twice -
the current way (with strcmp as the first check) and the new way (with
strcmp as the last check) - and to exit(3) if the results differed. I then
ran my October corpus of 5,871 messages received through bogofilter. No
differences were encounterd.
Second, I looked up the value of EPS. In globals.h, we have:
#include <float.h> /* has DBL_EPSILON */
#define EPS (100.0 * DBL_EPSILON) /* equality cutoff */
and in <float.h> is:
/* Difference between 1.0 and the minimum double greater than 1.0 */
#define DBL_EPSILON 2.2204460492503131e-16
My first was one of alarm because I expected a small difference from 0.0,
not from 1.0, say 1e-100 or some such value. I was about to change _our_
definition, but decided to think first - which was a good idea.
If I combined my good and ham word lists, I'd have approx 500,000 words
from approx 43,000 messages. Given MAX_REPEATS of 4 for Graham, and 1 for
Robinson, a token has a maximum value of 172,000, i.e. 43,000 * 4. Two
tokens will have either the same word counts or they won't. With the max
value of 172,000 two words with different counts will have probabilities
that differ by at least 1/172,000, i.e. 5.81e-06.
This bit of math proved to me that two words with different counts will
have distinctly different probabilities. So the compare routines do not
need an EPS test to catch a minute difference. The EPS test for different
probabilities will be effective when word lists get much larger - on the
order of 1e+16. We don't need to worry about that since that size exceeds
the number of words in all languages (assuming approx 1,000,000 words per
language times 1,000 languages).
So, with respect to qsort, we can do the simple thing - sort the tokens by
probabilities and, if not different, sort by alphanumeric value.
More information about the Bogofilter-dev