Maintaining a snappy bogofilter

Greg Louis glouis at dynamicro.on.ca
Fri Apr 11 00:47:33 CEST 2003


On 20030410 (Thu) at 0900:54 -0400, David Relson wrote:
> At 08:41 AM 4/10/03, Chris Ditri wrote:

> >I was wondering what people to do keep their goodlist and spamlist 
> >databases
> >fast and trim.  Do they need to be rebuilt from time to time or somehow
> >"defragged"?
> 
> My spamlist currently has 80,413 words and 11,306 messages and my goodlist 
> has 235,043 words and 29,736 messages.  Performance seems fine and I don't 
> do anything to keep it fast and trim.
> 
> If I _were_ to do something, I'd use the maintenance capabilities in 
> bogoutil.  Two capabilities in particular come to mind.  The first is the 
> ability to delete all hapaxes, i.e. words occurring only once in the 
> corpus.  The second is the ability to delete all words older than a certain 
> age.
> 
> The ability is there and I don't know at what point it becomes of value to 
> use it.

Me neither, but I think this observation is of potential interest: when
the rauss-hapax feature was implemented, I used it, (ie deleted all
tokens from the training db that had a count of 1) and my percentage
of false negatives promptly doubled!  At that time I had something like
a half million tokens in the spamlist, from fifteen thousand spams, and
about 300 thousand tokens in the goodlist, from some 9500 nonspams. 
Needless to say, I reverted to the bloated training db in something of
a hurry.

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |




More information about the Bogofilter mailing list