junk test

David Relson relson at osagesoftware.com
Wed May 28 21:58:04 CEST 2003


At 03:30 PM 5/28/03, John McCain wrote:
>Fascinating.  I was able to confirm this by creating a test message and then
>adding a number of junk tokens to it.  The spamicity score was unchanged.
>
>If I understand this situation correctly, then it would be possible to wash
>out all single tokens in the database with absolutely no impact on accuracy,
>assuming that no statistically significant token would repeat itself in X
>period of time.  Does this sound reasonable?

John,

You _could_ do that.

Greg tried it and found the results highly unsatisfactory.  He uses "train 
on error", rather than "-u" (autoupdate).  This results in lots of hapaxes 
in his wordlists.  All of them are there for good reason.  So, removing 
singletons was very bad for his environment.

If you're using autoupdate, your results may well be different.  You could 
try removing old singletons.  If bogofilter continues to work well, you 
could then try removing them all.

FWIW, I'm letting bogofilter autoupdate (with manual supervision).  My 
goodlist is presently 16MB and my spamlist is 5.5MB.  I've not seen a need 
to do any pruning.  An interesting experiment would be to graph hapaxes by 
date-last-used ...

David






More information about the Bogofilter mailing list