multiple wordlists

elijah elijah at riseup.net
Mon Mar 17 06:01:04 CET 2003


On Sat, 15 Mar 2003, David Relson wrote:

> So, given all this information, the question is "How should bogofilter deal
> with multiple wordlists?"

I like #2 (the average weight of two separate tokens).

Without any data, I have no good reason for my preference. My blind hunch
would be that this one is best for my situation because:

- most user wordlists will be sparse
- the site wordlist will be pretty darn good

Why would user wordlists be sparse? One drawback I see for bayesian
filtering is that most people simply don't get that much email. Your
typical user doesn't have a corpus on hand and doesn't get enough email in
a whole year to build up a really good dataset. Geeks might get enough in
a month, but your typical user is freaking out if they receive 1/10 the
mail of a techie.

Why would the site wordlist be any good? It would not if it was a
immutable 'corporate' standard. But if every user's mail is used as input
into the site wordlist in addition to their user wordlist, then the
site wordlist is a pretty good evolving democracy of tokens. This is
especially true in cases like mine, where we provide email to thousands of
radical activists--a fairly specific population with pretty similar email.
Even in less extreme cases, you could imagine a pretty good site wordlist.
Any university or company or organization or even regional ISP will have
some basis for a shared commonality among the users which sets them apart
from your 'average' internet user.

For these reasons, I suspect #2 will work best for a site with a fair
number of of users, with similar interests, and not much mail.

Even better, #4 seems like it would be ideal, but with opposite weighting
to what David suggested. Instead of privileging the user wordlist, I would
privilege the site wordlist at first and then slowly reduce W to 1 as the
user wordlist gets enough data to be meaningful (resulting in method #2).
I guess this is pretty much what David suggested, just with different values.

any, my two cents.

-elijah





More information about the Bogofilter mailing list