garbage removal and 'outsiders noise'

Wed Apr 16 19:22:57 CEST 2003

At 12:24 PM 4/16/03, Alejandro Dau wrote:

>Hello,
>   I have noticed that ignoring the words in the databases with count 1 
> give less false
>negatives than using them. Is it just my test enviroment? can you test 
>that on your mail
>and post your results?  Also the resulting db will be much more small than 
>the original.

Greetings Alejandro,

Welcome to the bogofilter mailing list.  We enjoy newcomers especially ones 
with new ideas and the skills to implement them and contribute the code.

>Here is a sample test on 148 messages (spam & ham), not used for training 
>the bases:
>
>i) For the complete training db:
>size of goodlist.db: 212992  (6329 words in 34 msg)
>size of spamlist.db: 458752 (14354 words in 144 msg)
>messages detected as spam: 85
>
>ii) For the 'trimmed down' db:
>size of good db: 40960  (821 words in 34 msg)
>size of spam db: 90112 (2474 words in 144 msg)
>messages detected as spam: 94
>
>Test (i) detected 12 spam messages that test (ii) didn't detect.
>Test (ii) detected 21 spam messages that test (i) didn't detect.
>No detection was false positive.
>
>I think that it may be useful to have a bogofilter option to 'ignore words 
>in database with
>counts less than n'. David, I may do the patch if you like.

Our algorithm expert, Greg Louis, has done a variety of tests.  In his 
tests, deleting hapaxes (the term for one-occurrence tokens) gives poorer 
results than keeping them.  Of course spam corpora vary so you results may 
differ.

He's also done some testing with varying values of the parameters used in 
the fisher algorithm, specifically the values of min_dev, robs, and 
spam_cutoff, to see how they affect bogofilter's accuracy.  Take look at 
his bogofilter website, www.bgl.nu/~glouis/bogofilter for his 
findings.  His most recent test, "Bogofilter parameters(continued)", shows 
that using  different parameters can have a major effect in making 
bogofilter more accurate.

I ran a series of tests using my mail.  I trained bogofilter with 6,173 
spam and 18,784 ham and then scored 4,317 spam and 9,567 ham.  For each set 
of parameters tested, spam_cutoff was chosen to give approx 0.2% false 
positives.  The number of false negatives varied from a high of 290 to a 
low of 60.

Conclusion, using a site's email to determine the best parameters for 
bogofilter can have a _big_ effect.

Corollary: do a thorough test of algorithmic/parametric changes to 
determine whether they are helpful or harmful.

David