garbage removal and 'outsiders noise'
Alejandro Dau
adau at datamarkets.com.ar
Wed Apr 16 18:24:06 CEST 2003
Hello,
I have noticed that ignoring the words in the databases with count 1 give less false
negatives than using them. Is it just my test enviroment? can you test that on your mail
and post your results? Also the resulting db will be much more small than the original.
Here is a sample test on 148 messages (spam & ham), not used for training the bases:
i) For the complete training db:
size of goodlist.db: 212992 (6329 words in 34 msg)
size of spamlist.db: 458752 (14354 words in 144 msg)
messages detected as spam: 85
ii) For the 'trimmed down' db:
size of good db: 40960 (821 words in 34 msg)
size of spam db: 90112 (2474 words in 144 msg)
messages detected as spam: 94
Test (i) detected 12 spam messages that test (ii) didn't detect.
Test (ii) detected 21 spam messages that test (i) didn't detect.
No detection was false positive.
I think that it may be useful to have a bogofilter option to 'ignore words in database with
counts less than n'. David, I may do the patch if you like.
Best regards
Alejandro
PS: To make a trimmed down db for the tests you can do:
bogoutil -d /tmp/complete/goodlist.db | bogoutil -l /tmp/trimmed/goodlist.db.new -c 1
mv /tmp/trimmed/goodlist.db.new /tmp/trimmed/goodlist.db
bogoutil -d /tmp/complete/spamlist.db | bogoutil -l /tmp/trimmed/spamlist.db.new -c 1
mv /tmp/trimmed/spamlist.db.new /tmp/trimmed/spamlist.db
And then invoke bogofilter with options -d /tmp/complete or -d /tmp/trimmed
More information about the Bogofilter
mailing list