[bogofilter] Filter twice: Global wordlist, then small personal wordlist

Tom Anderson tanderso at oac-design.com
Sun Apr 25 19:26:27 CEST 2004


On Fri, 2004-04-23 at 17:39, Chris Fortune wrote:
> my Question:  the personal wordlist will begin with just a few mails
> registered.  When is it safe to use it for classification?  How
> many emails must be registered before it is stable for one person's
> mail?

If your spam_cutoff is high and your ham_cutoff is low, min_dev and robs
are moderate, and robx is within your min_dev range (0.5 +/- min_dev),
then it is safe for classifications immediately.  Even without a single
registration, this will produce an "unsure" result for every email,
which is what I would consider "safe".  Each registration thereafter
will then effect classifications away from robx toward their respective
ham/spam sides.  With a few registrations, you'll start seeing emails
filtered into your spam and ham boxes, with the majority still unsure. 
With enough registrations (and slowly tinkering with the config values),
you should eventually have nearly all of the emails going to either ham
or spam.  Therefore, you'll never have an "unsafe" period.

> If the personal wordlist message-count is imbalanced, more spam than
> ham, what is the significance of this?

None that I've seen.  It may affect the "momentum" of certain values,
that is the amount they change due to a new registration, but not by
much, and not wrongly btw.  My experience is that such an imbalance
causes absolutely no harm.  I have at least 2:1 spam to ham, and my
classifications are great.

> Later on, I can assess each personal wordlist for accuracy.  Can a
> small wordlist be merged with a larger one?

I don't see much value in this.  Since you've expressed a justified
concern that your global wordlist is polluted by hasty or mistaken
registrations, plus the inevitability of differing definitions of ham
and spam, a global wordlist is not likely to offer much filtering
integrity.  If you're going to maintain independent wordlists anyway,
why not just do away with the global list?  Trying to merge would just
use up extra processing resources without much benefit IMHO.  Of course
you're welcome to try it, and let us know how it went ;)

Tom

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20040425/b5601563/attachment.sig>


More information about the Bogofilter mailing list