tuning and archives

Stroller Linux.Luser at myrealbox.com
Mon Feb 23 17:20:06 CET 2004


On Feb 23, 2004, at 1:24 pm, Boris 'pi' Piwinger wrote:
>
> I really don't know if it matters that the tokens in the
> database are balanced in number. Balancing them in content
> is done with training to exhaustion. This would just let
> your problem vanish.

I appear to get relatively little spam - well, my archive of ham is 
MUCH larger, but perhaps that's just because I've been more 
conscientious about storing it.

As a consequence, on my system bogofilter is trained on c 6500 ham and 
c 1650 spam, with cutoffs &c at defaults.
In order to catch up the imbalance, judging that I do get more spam 
than ham and that it will catch up eventually, periodically I manually 
check all incoming messages and train it on all of them (using the 
script I've proudly posted before:  
<http://article.gmane.org/gmane.mail.bogofilter.general/5935>).

But casual observation yesterday seemed to indicate that bogofilter 
only catches c 95% of incoming spam. I had noticed already that 
Bogofilter was not as accurate as might be anticipated (I think others 
expect 97% - 99%?) and had assumed this was due to the imbalance.

Stroller.





More information about the Bogofilter mailing list