qsort compare functions

Fri Nov 22 13:48:04 CET 2002

At 06:21 AM 11/22/02, Matthias Andree wrote:

>Object. This assumes we have a clear == semantic for double, which we
>don't really have. The EPS stuff is there to kill rounding errors.
>
>My code checks if d1->prob and d2->prob are close enough to each other
>to consider them equal.
>
>Remember: (1.0/3)*3 == 1.0 does not hold, and therefore < or > may match
>even though the values are "equal" in the rounding error constraints
>(EPS even).

Matthias,

Last night after changing _my_ compare routines to the code I want, I did 
two things.

First, I modified the qsort compare routines to do each comparison twice - 
the current way (with strcmp as the first check) and the new way (with 
strcmp as the last check) - and to exit(3) if the results differed.  I then 
ran my October corpus of 5,871 messages received through bogofilter.  No 
differences were encounterd.

Second, I looked up the value of EPS.  In globals.h, we have:

         #include <float.h> /* has DBL_EPSILON */
         #define EPS             (100.0 * DBL_EPSILON) /* equality cutoff */

and in <float.h> is:

         /* Difference between 1.0 and the minimum double greater than 1.0 */
         #undef DBL_EPSILON
         #define DBL_EPSILON 2.2204460492503131e-16

My first was one of alarm because I expected a small difference from 0.0, 
not from 1.0, say 1e-100 or some such value.  I was about to change _our_ 
definition, but decided to think first - which was a good idea.

If I combined my good and ham word lists, I'd have approx 500,000 words 
from approx 43,000 messages.  Given MAX_REPEATS of 4 for Graham, and 1 for 
Robinson, a token has a maximum value of 172,000, i.e. 43,000 * 4.  Two 
tokens will have either the same word counts or they won't.  With the max 
value of 172,000 two words with different counts will have probabilities 
that differ by at least 1/172,000, i.e. 5.81e-06.

This bit of math proved to me that two words with different counts will 
have distinctly different probabilities.  So the compare routines do not 
need an EPS test to catch a minute difference.  The EPS test for different 
probabilities will be effective when word lists get much larger - on the 
order of 1e+16.  We don't need to worry about that since that size exceeds 
the number of words in all languages (assuming approx 1,000,000 words per 
language times 1,000 languages).

So, with respect to qsort, we can do the simple thing - sort the tokens by 
probabilities and, if not different, sort by alphanumeric value.