wordlist.db problem

Tom Anderson tanderso at oac-design.com
Fri Jun 18 14:03:23 CEST 2004


On Fri, 2004-06-18 at 06:02, Peter Bishop wrote:
> I have never used -u, because there is a risk of incorrectly updating 
> the database. Also It can make your database very large very quickly
> and this can slow to response to new types of spam.

You can avert this risk by simply making sure to register anything that
bogofilter gets wrong.  Using -u, you still train-on-error, but the
non-errors are also trained automatically.  The benefit is that you have
a much richer set of tokens with which to classify.  This is especially
true for hams, which for me always score under 0.15 now.  Without using
-u, it's much harder to accurately classify emails since there's often
little to go on.

As for the size, it's not really that bad.  My db is around 25M.  It
grows quicker at first, but once you have thousands of tokens in there,
the incidence of hapaxes is reduced, and the growth is much slower...
only the counts are incremented for the most part.  Those little
electronic Oxford multilingual dictionaries only have like 16M of
storage, so there's a limit to how far wordlists can grow.  Yeah, I
know, if you did every permutation of all ASCII characters up to 30
long, you'd have terabytes, but spammers apparently aren't clever enough
to fill their spams with that level of randomness.  And if they ever
became so, then we'll just develop good methods of removing the clutter.

Tom





More information about the Bogofilter mailing list