[bogofilter] Filter twice: Global wordlist, then small personal wordlist

Sun Apr 25 18:21:39 CEST 2004

Chris Fortune wrote:
> In the never ending quest to provide perfect spam filtering at low
> cost.........
> 
> I now have a global wordlist that filters most spam for most users.
> Great!  Thanks, guys.
> 
> I would like users to be able to train it, but some users have proven
> themselves to be horribly unreliable in their judgements.  So, rather
> than create some sort of "user reliability quotient (R.Q.)" and put
> out their brush fires,  I would like them to each have their own
> wordlists.  That way if they screw up their re-classification, they
> only bodge their own mail.  I also hope that these personal wordlists
> will be small and light, and hopefully dead accurate.
> 
> my Question:  the personal wordlist will begin with just a few mails
> registered.  When is it safe to use it for classification?  How many
> emails must be registered before it is stable for one person's mail?
> 
> If the personal wordlist message-count is imbalanced, more spam than
> ham, what is the significance of this?
> 
> Later on, I can assess each personal wordlist for accuracy.  Can a
> small wordlist be merged with a larger one?
> 
> Idealistic question:  is it possible to temporarily merge two
> wordlists in memory, like this imaginary command:
> 
> bogofilter -d/path/to/global.wordlist.db -d/path/to/user.wordlist.db
> --merge  < test.eml
> 
> 

How about running bogofilter twice?

Once with a read-only on the main list, putting all Unsure into a 
seperate category.  You would want a very wide range for Unsure on this 
public common list so that you are assured everything Spam/Ham is 99% so.

Something almost like:
:0fw
| bogofilter -pe -d /etc/bogofilter/wordlist.db

:0
* ^X-Bogosity: Unsure
{
	:0fw
	| formail -i X-Bogosity

	:0fw
	| bogofilter -peu   # uses userspace wordlist

}

:0
* ^X-Bogosity: Spam
/quarantine   # or whatever

:0
* ^X-Bogosity: Unsure
/somewhere_else

_________________________________
You could additionally, if using imap, create a variety of scripts that
correct for incorrectly identified email into the userspace wordlists 
and additionally copy them back to you for you to process through 
bogofilter-common if you wanted to.
This might make a good arguement for only training on the UnSure with 
regards to the common list to keep the workload to a minimum.

However, the comment about "light and dead accurate" tend to contradict 
each other in statistics in general.  But you might be able to get 
somewhere with it easily enough.  Recall that you are now only testing 
the users wordlist against what is initially measured as UnSure.