multiple wordlists
David Relson
relson at osagesoftware.com
Mon Mar 17 13:54:20 CET 2003
At 07:17 AM 3/17/03, Greg Louis wrote:
> > Even better, #4 seems like it would be ideal, but with opposite weighting
> > to what David suggested. Instead of privileging the user wordlist, I would
> > privilege the site wordlist at first and then slowly reduce W to 1 as the
> > user wordlist gets enough data to be meaningful (resulting in method #2).
> > I guess this is pretty much what David suggested, just with different
> values.
>
>The nice thing about #4 is that you can tune it to be right for your
>user group and I can tune it to be right for mine. (The bad thing about
>it is that almost everyone will use the default weights that happen to
>come with the package, just as they use the default s, x and min_dev,
>and will thereby fail to get the best performance for their
>environments.)
As a reminder below are one line summaries of the 4 combinational method
suggested so far. I've also given them single word names, which are
hopefully meaningful.
1.total - total token counts and message counts before calculating any
token probabilities.
2.combine - each (token,list) combination counts as one token in the final
score.
3.precedence - after finding a token in a wordlist pair, other pairs aren't
checked
4.weighting - a weighting factor is applied for each list
Observation 1: #4 (weighting) calls for weight factors to be included in
the config files.
Observation 2: #1, #2, and #3 can be expressed in terms of #4 -
2.combine is the same as 4.weighting with equal weights for all lists.
3.precedence is comparable to 4.weighting with a large difference in
weighting, say 100::1.
1.total is comparable to 4.weighting with weights based on relative sizes
(messages counts)
Observation 3 - an report from a group of sysadmins last Thursday.
Mike maintains a site with 4000 or so email users. An attempt to create a
site wide spam filter with one rule set for all users failed for the
obvious reason - conflicting definitions of spam/ham. Dividing the group
into 26 segments (based on first letter of user id) was successful. Given
the arbitrary nature of the grouping, this seems a bit surprising. I'd
have thought it necessary to use a division by category (marketing,
engineering, ...). The moral is (probably) you can't tell what will work
for a group until you experiment to learn about _your_ user community.
Observation 4:
Weighting is probably not static. The proper weights to use will change
over time as list sizes change.
More information about the Bogofilter
mailing list