best practices question

Fri Sep 20 22:12:28 CEST 2002

On Fri, Sep 20, 2002 at 02:01:54PM -0400, David Relson wrote:
>
> 1 - Create good and spam word lists (using the '-h' and '-s' options).  Let 
> bogofilter classify messages.  For incorrectly classified messages, feed 
> them into the word lists (again using the '-h' and '-s' options).
> 
> 2 - Create word lists (as above).  When a message is classified as spam, 
> automatically merge it into the word list (using '-s').  This will expand 
> the spam list by including words that have "appeared in a spam 
> context".  For incorrectly classified messages, use the '-H' and '-S' 
> options so that probabilities will shift from the wrong answer to the right 
> answer.
> 
> What do y'all think is the best practice for handling word list updating?

If I understand you correctly, 2, by far.

For me, a big part of the utility of a bayesian spam filter is that I
don't have to do the work of figuring out what makes spam identifiable
as spam.  All I have to do is identify it, and let the software find
the interesting words.

Looking at the counts kept by ifile, I see that "border" has appeared
26 times in nonspam and 1874 times in spam.  It would normally not
occur to me to filter on this word.

I also train the filter on good mail, and I don't see why anyone
wouldn't, if they're using a private set of word lists.

-- 
Ben Rosengart     (212) 741-4400 x215

Microsoft has argued that open source is bad for business, but you
have to ask, "Whose business?  Theirs, or yours?"    --Tim O'Reilly

For summay digest subscription: bogofilter-digest-subscribe at aotto.com