garbage removal and 'outsiders noise'

Alejandro Dau adau at datamarkets.com.ar
Wed Apr 16 18:24:06 CEST 2003


Hello,
  I have noticed that ignoring the words in the databases with count 1 give less false 
negatives than using them. Is it just my test enviroment? can you test that on your mail 
and post your results?  Also the resulting db will be much more small than the original. 

Here is a sample test on 148 messages (spam & ham), not used for training the bases:

i) For the complete training db:
size of goodlist.db: 212992  (6329 words in 34 msg) 
size of spamlist.db: 458752 (14354 words in 144 msg)
messages detected as spam: 85  

ii) For the 'trimmed down' db:
size of good db: 40960  (821 words in 34 msg) 
size of spam db: 90112 (2474 words in 144 msg)
messages detected as spam: 94

Test (i) detected 12 spam messages that test (ii) didn't detect.
Test (ii) detected 21 spam messages that test (i) didn't detect.
No detection was false positive.

I think that it may be useful to have a bogofilter option to 'ignore words in database with 
counts less than n'. David, I may do the patch if you like.

Best regards
Alejandro

PS: To make a trimmed down db for the tests you can do:

bogoutil -d /tmp/complete/goodlist.db |  bogoutil -l /tmp/trimmed/goodlist.db.new -c 1 
mv /tmp/trimmed/goodlist.db.new /tmp/trimmed/goodlist.db  
bogoutil -d /tmp/complete/spamlist.db | bogoutil -l /tmp/trimmed/spamlist.db.new -c 1 
mv /tmp/trimmed/spamlist.db.new /tmp/trimmed/spamlist.db 

And then invoke bogofilter with options -d /tmp/complete or -d /tmp/trimmed





More information about the Bogofilter mailing list