multiple wordlists

elijah elijah at riseup.net
Mon Mar 17 18:25:15 CET 2003


On Mon, 17 Mar 2003, Greg Louis wrote:

> We're catching around 95% of spam for that lot with bogofilter, though
> until quite recently it was more like 85%.  But for those of us with
> enough email to run personal lists, we can get 99% or so.

On Mon, 17 Mar 2003, David Relson wrote:

> Mike maintains a site with 4000 or so email users.  An attempt to create
> a site wide spam filter with one rule set for all users failed for the
> obvious reason - conflicting definitions of spam/ham.  Dividing the
> group into 26 segments (based on first letter of user id) was
> successful.  Given the arbitrary nature of the grouping, this seems a
> bit surprising.

I guess the lesson here is that you can predict all you want, but what
matters is the empirical data. For this reason, in the future I will
remain mum when it comes to wild speculation and base any claims on real
data.

Are there productions sites with a significant user base keeping accurate
statistics? If we could collate a bunch of real data, general rules might
emerge, like this bit about division by letter working better. I am
surprised by it, only because generally you have about ten times as many
Ms as Qs--so some of the group wordlists look really different in size
than others.

Now I will violate my just-made resolution to keep to the facts: I suspect
that another big problem with shared wordlists is garbage in, garbage out.
Your average user is probably not going to be very diligent about
correcting misidentified mail.

-elijah





More information about the Bogofilter mailing list