Accuracy is lacking

Tracy R Reed treed at ultraviolet.org
Thu Feb 13 21:44:26 CET 2003


I used bayespam months ago and was really happy with it but it is in perl
which caused lots of startup overhead and it did not integrate well with
qmail so I went with bogofilter. But I am having major accuracy problems.
I properly fed bogofilter my large spam folder and my large saved-mail
folder to generate the bad and good wordlists and they now have plenty of
data:

-rw-------    1 alias    nofiles   5877760 Feb 13 12:32 goodlist.db
-rw-------    1 alias    nofiles   2506752 Feb 13 12:32 spamlist.db

I don't get any false positives but it is missing a lot of spam. I would
say 3/4 of the spam makes it into my inbox and only 1/4 gets filtered.
Looking at the spamicity measurement in the header they all fall very near
to 0.5. With bayespam the values for spam were always very high and
non-spam very low so there were very few edge cases. Here everything seems
to fall right on the line and most of it ends up in my inbox. I always
send the misclassified spam back through bogofilter to correct the
database but it does not seem to be gaining me anything.

Normally mail that is flagged as spam is procmailed into my spam folder.
Spam that ends up in my mailbox gets piped through bogofilter for
correction and then saved in the spam folder. Here is all of the spam I
manually saved into my spam folder this week that was misclassified:

 No, tests=bogofilter, spamicity=0.480839, version=0.10.1.5
 No, tests=bogofilter, spamicity=0.511638, version=0.10.1.5
 No, tests=bogofilter, spamicity=0.481192, version=0.10.1.5
 No, tests=bogofilter, spamicity=0.381694, version=0.10.1.5
 No, tests=bogofilter, spamicity=0.399330, version=0.10.1.5
 No, tests=bogofilter, spamicity=0.491937, version=0.10.1.5
 No, tests=bogofilter, spamicity=0.431448, version=0.10.1.5
 No, tests=bogofilter, spamicity=0.520547, version=0.10.1.5
 No, tests=bogofilter, spamicity=0.480346, version=0.10.1.5
 No, tests=bogofilter, spamicity=0.442215, version=0.10.1.5
 No, tests=bogofilter, spamicity=0.467126, version=0.10.1.5
 No, tests=bogofilter, spamicity=0.475594, version=0.10.1.5
 No, tests=bogofilter, spamicity=0.354083, version=0.10.1.5
 No, tests=bogofilter, spamicity=0.415383, version=0.10.1.5
 No, tests=bogofilter, spamicity=0.487420, version=0.10.1.5
 No, tests=bogofilter, spamicity=0.486355, version=0.10.1.5
 No, tests=bogofilter, spamicity=0.536275, version=0.10.1.5
 No, tests=bogofilter, spamicity=0.456555, version=0.10.1.5
 No, tests=bogofilter, spamicity=0.416585, version=0.10.1.5
 No, tests=bogofilter, spamicity=0.518219, version=0.10.1.5
 No, tests=bogofilter, spamicity=0.494694, version=0.10.1.5
 No, tests=bogofilter, spamicity=0.493999, version=0.10.1.5
 No, tests=bogofilter, spamicity=0.472500, version=0.10.1.5
 No, tests=bogofilter, spamicity=0.503451, version=0.10.1.5
 No, tests=bogofilter, spamicity=0.369909, version=0.10.1.5
 No, tests=bogofilter, spamicity=0.525622, version=0.10.1.5
 No, tests=bogofilter, spamicity=0.459727, version=0.10.1.5
 No, tests=bogofilter, spamicity=0.520119, version=0.10.1.5

And here's a sampling of half the spam that was properly flagged as spam:

 Yes, tests=bogofilter, spamicity=0.583354, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.702942, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.810499, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.814493, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.606610, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.697771, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.697790, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.580406, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.769638, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.646073, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.623065, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.610091, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.679078, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.893678, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.793094, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.746744, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.600768, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.607617, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.559252, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.632653, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.702682, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.576008, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.786770, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.597972, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.625881, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.739803, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.559261, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.627303, version=0.10.1.5
 Yes, tests=bogofilter, spamicity=0.788176, version=0.10.1.5

Not much difference in some cases. Any suggestions?

-- 
Tracy Reed      http://www.ultraviolet.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 240 bytes
Desc: not available
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20030213/de4e778b/attachment.sig>


More information about the Bogofilter mailing list