[bogofilter] Filter twice: Global wordlist, then small personal wordlist

David Relson relson at osagesoftware.com
Sat Apr 24 00:55:12 CEST 2004


On Fri, 23 Apr 2004 14:39:53 -0700
Chris Fortune wrote:

> In the never ending quest to provide perfect spam filtering at low
> cost.........
> 
> I now have a global wordlist that filters most spam for most users. 
> Great!  Thanks, guys.
> 
> I would like users to be able to train it, but some users have proven
> themselves to be horribly unreliable in their judgements.  So, rather
> than create some sort of "user reliability quotient (R.Q.)" and put
> out their brush fires,  I would like them to each have their own
> wordlists. That way if they screw up their re-classification, they
> only bodge their own mail.  I also hope that these personal wordlists
> will be small and light, and hopefully dead accurate.

Hi Chris,

We all hope for the magical silver bullet.  After all the Lone Ranger
had silver bullets decades ago, why can't we?

That being said, there is no magical answer to your questions :-<

> my Question:  the personal wordlist will begin with just a few mails
> registered.  When is it safe to use it for classification?  How many
> emails must be registered before it is stable for one person's mail?

No hard and fast rules here.  A lot depends on whether the line between
ham and spam is clear and clean.  For the lucky few, it is.  For the
rest of us, there no clear line.  Off-hand I'd say bogofilter starts to
give usable results with very little training.  As few as 50 or 100
messages each of ham and spam is enough to get it started.  As always,
human supervision is a good idea -- particularly in checking for false
positives -- cause no one wants to discard an important message.

> If the personal wordlist message-count is imbalanced, more spam than 
> ham, what is the significance of this?

Again, there's no hard and fast rule.  Try to keep the ham::spam message
ratio within 1::2 (or 2::1).  

> Later on, I can assess each personal wordlist for accuracy.  Can a 
> small wordlist be merged with a larger one?
> 
> Idealistic question:  is it possible to temporarily merge two 
> wordlists in memory, like this imaginary command:
> 
>     bogofilter -d/path/to/global.wordlist.db 
>                -d/path/to/user.wordlist.db --merge  < test.eml

You are, I think, asking how best to simultaneously use a pair of
wordlists.  Two major policies come to mind:

1 - add the counts from the global & user lists and compute the
"combined" score.

2 - give preference to one list.  Since the user's list is tailored to
his/her personal definitions of ham and spam, if bogofilter finds a word
in the user's list it should use that result and ignore the global list.

Bogofilter used to have support for multiple wordlists.  An "importance"
value could be assigned to each one.  When searching for a word, the
wordlists would be searched in order of "importance".  Once found,
bogofilter would check for additional wordlists with the same importance
and combined the scores.  Thus policy #1 could be implemented by
assigning the same "importance" value to both lists and policy #2 could
be implemented by assigning greater "importance" to the user list than
the global list.  However, since nobody was using the feature and it
hadn't been thoroughly tested, it was removed as part of the bogofilter
cleanup several months ago.

I home I haven't bored you to death with all this information/history
:->

David



-- 
David Relson                   Osage Software Systems, Inc.
relson at osagesoftware.com       Ann Arbor, MI 48103
www.osagesoftware.com          tel:  734.821.8800



More information about the Bogofilter mailing list