multiple wordlists

Mon Mar 17 13:17:32 CET 2003

On 20030316 (Sun) at 2101:04 -0800, elijah wrote:
> On Sat, 15 Mar 2003, David Relson wrote:
> 
> > So, given all this information, the question is "How should bogofilter deal
> > with multiple wordlists?"
> 
> I like #2 (the average weight of two separate tokens).
> 
> Without any data, I have no good reason for my preference. My blind hunch
> would be that this one is best for my situation because:
> 
> - most user wordlists will be sparse
> - the site wordlist will be pretty darn good

Voluminous != good although sparse == bad, because diverse == bad too.  

> Why would the site wordlist be any good? It would not if it was a
> immutable 'corporate' standard. But if every user's mail is used as input
> into the site wordlist in addition to their user wordlist, then the
> site wordlist is a pretty good evolving democracy of tokens. This is
> especially true in cases like mine, where we provide email to thousands of
> radical activists--a fairly specific population with pretty similar email.

Not especially -- only.

> Even in less extreme cases, you could imagine a pretty good site wordlist.
> Any university or company or organization or even regional ISP will have
> some basis for a shared commonality among the users which sets them apart
> from your 'average' internet user.

Experience shows that doesn't usually suffice.  Any organization of any
size has very disparate subpopulations.  Take a company like the one I
work for: engineers, marketing folks, sales people, purchasers,
production planners, IT people, test technicians -- and then, laid on
top of that, an even wider spread of personal interests and contacts. 
We're catching around 95% of spam for that lot with bogofilter, though
until quite recently it was more like 85%.  But for those of us with
enough email to run personal lists, we can get 99% or so.  You could
say that's 5 times better -- I get one spam a day, my colleague next
door gets 5, out of the hundred or so that arrive for each of us.  (An
interesting statistic: 41% of all the mail our division got in the
first week of March was junk!)

> For these reasons, I suspect #2 will work best for a site with a fair
> number of of users, with similar interests, and not much mail.

With this I'd agree.  However, my point is that such sites are, I
believe, less common than you suggest.

> Even better, #4 seems like it would be ideal, but with opposite weighting
> to what David suggested. Instead of privileging the user wordlist, I would
> privilege the site wordlist at first and then slowly reduce W to 1 as the
> user wordlist gets enough data to be meaningful (resulting in method #2).
> I guess this is pretty much what David suggested, just with different values.

The nice thing about #4 is that you can tune it to be right for your
user group and I can tune it to be right for mine.  (The bad thing about
it is that almost everyone will use the default weights that happen to
come with the package, just as they use the default s, x and min_dev,
and will thereby fail to get the best performance for their
environments.)

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |