multiple wordlists

Mon Mar 17 18:46:24 CET 2003

At 12:25 PM 3/17/03, elijah wrote:

>On Mon, 17 Mar 2003, Greg Louis wrote:
>
> > We're catching around 95% of spam for that lot with bogofilter, though
> > until quite recently it was more like 85%.  But for those of us with
> > enough email to run personal lists, we can get 99% or so.
>
>On Mon, 17 Mar 2003, David Relson wrote:
>
> > Mike maintains a site with 4000 or so email users.  An attempt to create
> > a site wide spam filter with one rule set for all users failed for the
> > obvious reason - conflicting definitions of spam/ham.  Dividing the
> > group into 26 segments (based on first letter of user id) was
> > successful.  Given the arbitrary nature of the grouping, this seems a
> > bit surprising.
>
>I guess the lesson here is that you can predict all you want, but what
>matters is the empirical data. For this reason, in the future I will
>remain mum when it comes to wild speculation and base any claims on real
>data.

No need to keep mum.  Speculations can be a good start for discussions and 
for generating an understanding of how/why things work and what needs to be 
done.

>Are there productions sites with a significant user base keeping accurate
>statistics? If we could collate a bunch of real data, general rules might
>emerge, like this bit about division by letter working better. I am
>surprised by it, only because generally you have about ten times as many
>Ms as Qs--so some of the group wordlists look really different in size
>than others.

All we really "know" is that 1 large group wasn't useful but 26 smaller 
groups was useful.  We don't even know if some groups worked better than 
others, though I'd bet that some were better than others.  I'd expect 
smaller groups to work better than larger ones, e.g Q and X vs. M and S, 
but I've no information on that.