tuning and archives
Stroller
Linux.Luser at myrealbox.com
Mon Feb 23 17:20:06 CET 2004
On Feb 23, 2004, at 1:24 pm, Boris 'pi' Piwinger wrote:
>
> I really don't know if it matters that the tokens in the
> database are balanced in number. Balancing them in content
> is done with training to exhaustion. This would just let
> your problem vanish.
I appear to get relatively little spam - well, my archive of ham is
MUCH larger, but perhaps that's just because I've been more
conscientious about storing it.
As a consequence, on my system bogofilter is trained on c 6500 ham and
c 1650 spam, with cutoffs &c at defaults.
In order to catch up the imbalance, judging that I do get more spam
than ham and that it will catch up eventually, periodically I manually
check all incoming messages and train it on all of them (using the
script I've proudly posted before:
<http://article.gmane.org/gmane.mail.bogofilter.general/5935>).
But casual observation yesterday seemed to indicate that bogofilter
only catches c 95% of incoming spam. I had noticed already that
Bogofilter was not as accurate as might be anticipated (I think others
expect 97% - 99%?) and had assumed this was due to the imbalance.
Stroller.
More information about the Bogofilter
mailing list