Floating point errors?

David Relson relson at osagesoftware.com
Sun Jul 15 02:24:07 CEST 2007


On Sun, 8 Jul 2007 18:49:33 +0200
Ingomar Wesp wrote:

> Hello there.
> 
> I recently discovered that my bogofilter setup (bogofilter 1.1.3 on
> GNU/Linux) stopped working properly. While classification was pretty
> good over the last few years, bogofilter suddenly stopped detecting
> spam-mails - even if they contained a whole lot of bad tokens.
> 
> In order to figure out what’s wrong, I looked at the output 
> of 'bogofilter -vvv' on a mail that was an obvious example of spam.
> This is an abbreviated version of what I saw:

...[snip]...
 
> It appears that proper classification fails due to a lack of floating
> point precision in the calculation of the numerical values for pgood
> and fw. A lookup of some of the tokens in my wordlist.db brings up
> the following:
> 
> | > bogoutil -w ~/.bogofilter Aktien Anlageempfehlung Frankfurt fuer
> |                                  spam   good
> | Aktien                             51      0
> | Anlageempfehlung                   25      0
> | Frankfurt                         163      0
> | fuer                              318     33
> 
> From what I’ve figured out so far, the database does not appear to be
> broken.
> 
> Has anybody encountered a similar behaviour on his or her setup? Are
> there any known fixes? And if not, does anybody know the exact
> formula that could bring up these floating point errors?
> 
> Any advice would be very appreciated. Especially since manually
> sorting out loads of spam is not a particularly entertaining task ;-)
> 
> So, thanks in advance and have a pleasant week,
> Ingomar Wesp

Hello Ingomar,

Sorry for the delayed response but I just got back from a week's
vacation.

Running "bogoutil -w" was the right idea, but running it slightly
differently would be even better!  

First, using the "-p" option will print the spam and good _counts_ and
will also print the spam score. Second, add token ".MSG_COUNT" (with
the dot but not the quotes) to the command line.

As a quick test, I built two wordlists using your 4 tokens.  The first
list had .MSG_COUNT set to 0 (for both spam and ham) and produced "nan"
results.  The second test set .MSG_COUNT to 400 and produced normal
results.

Offhand, it sounds like your wordlist may have become broken.  What
database and distribution are you using?

Regards,

David



More information about the Bogofilter mailing list