junk test
David Relson
relson at osagesoftware.com
Wed May 28 21:58:04 CEST 2003
At 03:30 PM 5/28/03, John McCain wrote:
>Fascinating. I was able to confirm this by creating a test message and then
>adding a number of junk tokens to it. The spamicity score was unchanged.
>
>If I understand this situation correctly, then it would be possible to wash
>out all single tokens in the database with absolutely no impact on accuracy,
>assuming that no statistically significant token would repeat itself in X
>period of time. Does this sound reasonable?
John,
You _could_ do that.
Greg tried it and found the results highly unsatisfactory. He uses "train
on error", rather than "-u" (autoupdate). This results in lots of hapaxes
in his wordlists. All of them are there for good reason. So, removing
singletons was very bad for his environment.
If you're using autoupdate, your results may well be different. You could
try removing old singletons. If bogofilter continues to work well, you
could then try removing them all.
FWIW, I'm letting bogofilter autoupdate (with manual supervision). My
goodlist is presently 16MB and my spamlist is 5.5MB. I've not seen a need
to do any pruning. An interesting experiment would be to graph hapaxes by
date-last-used ...
David
More information about the Bogofilter
mailing list