Maintaining a snappy bogofilter
Greg Louis
glouis at dynamicro.on.ca
Fri Apr 11 00:47:33 CEST 2003
On 20030410 (Thu) at 0900:54 -0400, David Relson wrote:
> At 08:41 AM 4/10/03, Chris Ditri wrote:
> >I was wondering what people to do keep their goodlist and spamlist
> >databases
> >fast and trim. Do they need to be rebuilt from time to time or somehow
> >"defragged"?
>
> My spamlist currently has 80,413 words and 11,306 messages and my goodlist
> has 235,043 words and 29,736 messages. Performance seems fine and I don't
> do anything to keep it fast and trim.
>
> If I _were_ to do something, I'd use the maintenance capabilities in
> bogoutil. Two capabilities in particular come to mind. The first is the
> ability to delete all hapaxes, i.e. words occurring only once in the
> corpus. The second is the ability to delete all words older than a certain
> age.
>
> The ability is there and I don't know at what point it becomes of value to
> use it.
Me neither, but I think this observation is of potential interest: when
the rauss-hapax feature was implemented, I used it, (ie deleted all
tokens from the training db that had a count of 1) and my percentage
of false negatives promptly doubled! At that time I had something like
a half million tokens in the spamlist, from fifteen thousand spams, and
about 300 thousand tokens in the goodlist, from some 9500 nonspams.
Needless to say, I reverted to the bloated training db in something of
a hurry.
--
| G r e g L o u i s | gpg public key: finger |
| http://www.bgl.nu/~glouis | glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |
More information about the Bogofilter
mailing list