bogotune results

Greg Louis glouis at dynamicro.on.ca
Thu Mar 25 13:51:02 CET 2004


On 20040324 (Wed) at 2202:17 -0500, Tom Allison wrote:

> I'm working with the assumption that my archive of spam/ham has already 
> been tuned/trained to such an extent that they are all seperated into 
> two distinct ranges and that they can be represented successfully with 
> distinct ham_cutoff (highest_ham+) and spam_cutoff (lowest_spam-) values.

Most people receive email that is not capable of such clean separation. 
For example, the last fp I got was a hotel-reservation confirmation
that included a section containing gratuitous marketing drivel.  At work,
where a proportion of our legitimate mail is concerned with selling our
products to interested customers, it's impossible to keep fp down
without allowing a certain amount of spam to get through.  Such a hope
as you expressed (and I would be delighted if bogofilter can do that
for you) would not be realistic in regard to most email populations.

> Now that I have some 2800+ ham tokens and 2000+ spam tokens and ~2500 
> each of ham/spam emails, I should hope to avoid, with certainty, the 
> chance that a good email will score across the Unsure and all the way 
> into the Spam group and similarly with spam doing the same.

I will say it again: statistical filtering is all about likelihood and
never about certainty.  The only way to avoid all false positives with
certainty is to set the spam cutoff to 1.

> I have yet to see ham do that, but spam does it fairly regularly (once a 
> week).

Well, I have three quarters of a million tokens,  well over half a
million of which occur in spam only, and my training db was made from
23,471 spams and 21,660 nonspams.  I haven't had a false positive for
nine weeks now, at some 1500 nonspams a day, and I get a similar very
low number of spam classified as nonspam.  However, I do get around
0.5% of spams -- about two a day, since I get 400 to 500 spams daily --
turning up as unsure with the cutoffs I'm using at present.  By
contrast (although it's not a fair comparison, because bogofilter has
improved a lot since then), when I had 2500 messages each, I was
getting a false positive every couple oof weeks and about 5% of spam
being delivered.  Fortunately, a lot less spam was being sent me back
then.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |




More information about the Bogofilter mailing list