multiple wordlists

Greg Louis glouis at dynamicro.on.ca
Mon Mar 17 14:09:19 CET 2003


On 20030317 (Mon) at 0754:20 -0500, David Relson wrote:

> Observation 3 - an report from a group of sysadmins last Thursday.
> 
> Mike maintains a site with 4000 or so email users.  An attempt to create a 
> site wide spam filter with one rule set for all users failed for the 
> obvious reason - conflicting definitions of spam/ham.  Dividing the group 
> into 26 segments (based on first letter of user id) was successful.  Given 
> the arbitrary nature of the grouping, this seems a bit surprising.  I'd 
> have thought it necessary to use a division by category (marketing, 
> engineering, ...).  The moral is (probably) you can't tell what will work 
> for a group until you experiment to learn about _your_ user community.

Actually, the success of the arbitrary division is perhaps not _that_
surprising.  One would expect the impact of population diversity to be
related in some non-linear way -- perhaps even exponentially -- to the
size of the group.  I'd expect classification by occupation to work
better, provided the users didn't get large volumes of non-work-related
email (some of mine do), but in general, lists for smaller groups --
provided the training db doesn't get too tiny, which was Elijah's point
-- ought to do better than big ones.  (Mike's q and x groups ought to
be particularly successful ;)

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |




More information about the Bogofilter mailing list