Floating point errors?

Ingomar Wesp wesp at inode.at
Sun Jul 8 18:49:33 CEST 2007


Hello there.

I recently discovered that my bogofilter setup (bogofilter 1.1.3 on GNU/Linux) 
stopped working properly. While classification was pretty good over the last 
few years, bogofilter suddenly stopped detecting spam-mails - even if they 
contained a whole lot of bad tokens.

In order to figure out what’s wrong, I looked at the output 
of 'bogofilter -vvv' on a mail that was an obvious example of spam. This is 
an abbreviated version of what I saw:

| X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.1.3
|                                        n    pgood     pbad      fw     U

|  "A0ML2L"                              1       nan  0.000039       nan -
|  "Aktien"                             51       nan  0.001999       nan -
|  "Anlageempfehlung"                   25       nan  0.000980       nan -
|  "Aufforderung"                       27       nan  0.001058       nan -
|  "Boerse"                             26       nan  0.001019       nan -
|  "Chartsanalyse"                       1       nan  0.000039       nan -
|  "Der"                               331       nan  0.012972       nan -
|  "Die"                               537       nan  0.021045       nan -
|  "Diese"                             291       nan  0.011404       nan -
|  "Frankfurt"                         163       nan  0.006388       nan -
|  "Gesellschaft"                       63       nan  0.002469       nan -

|  "hat"                               664       inf  0.024650  0.000014 +
|  "oder"                              644       inf  0.015245  0.000014 +
|  "fuer"                              351       inf  0.012462  0.000026 +

It appears that proper classification fails due to a lack of floating point 
precision in the calculation of the numerical values for pgood and fw. A 
lookup of some of the tokens in my wordlist.db brings up the following:

| > bogoutil -w ~/.bogofilter Aktien Anlageempfehlung Frankfurt fuer
|                                  spam   good
| Aktien                             51      0
| Anlageempfehlung                   25      0
| Frankfurt                         163      0
| fuer                              318     33

From what I’ve figured out so far, the database does not appear to be broken.

Has anybody encountered a similar behaviour on his or her setup? Are there any 
known fixes? And if not, does anybody know the exact formula that could bring 
up these floating point errors?

Any advice would be very appreciated. Especially since manually sorting out 
loads of spam is not a particularly entertaining task ;-)

So, thanks in advance and have a pleasant week,
Ingomar Wesp

-- 
 ____ )) _<http://ingomar.wesp.name/>_ .. ___________ ,^\\|//^. _______
(    ((          |     ~~           |  ||            //(-x-x-)\\      (
 ) (|~~| [#### ] | ====#    [###  ] | |~~| [#    ]   '|(,,^,,)|`       )
(__ '==' ________|__ (_.._)_________| |__| ___________ .\,,,/. ____iw_(



More information about the Bogofilter mailing list