multiple wordlists

David Relson relson at osagesoftware.com
Mon Mar 17 14:19:10 CET 2003


At 08:09 AM 3/17/03, Greg Louis wrote:

>On 20030317 (Mon) at 0754:20 -0500, David Relson wrote:
>
> > Observation 3 - an report from a group of sysadmins last Thursday.
> >
> > Mike maintains a site with 4000 or so email users.  An attempt to create a
> > site wide spam filter with one rule set for all users failed for the
> > obvious reason - conflicting definitions of spam/ham.  Dividing the group
> > into 26 segments (based on first letter of user id) was successful.  Given
> > the arbitrary nature of the grouping, this seems a bit surprising.  I'd
> > have thought it necessary to use a division by category (marketing,
> > engineering, ...).  The moral is (probably) you can't tell what will work
> > for a group until you experiment to learn about _your_ user community.
>
>Actually, the success of the arbitrary division is perhaps not _that_
>surprising.  One would expect the impact of population diversity to be
>related in some non-linear way -- perhaps even exponentially -- to the
>size of the group.  I'd expect classification by occupation to work
>better, provided the users didn't get large volumes of non-work-related
>email (some of mine do), but in general, lists for smaller groups --
>provided the training db doesn't get too tiny, which was Elijah's point
>-- ought to do better than big ones.  (Mike's q and x groups ought to
>be particularly successful ;)

Followed closely by y and z.  s probably does poorly.





More information about the Bogofilter mailing list