bogotune results
Greg Louis
glouis at dynamicro.on.ca
Thu Mar 25 13:51:02 CET 2004
On 20040324 (Wed) at 2202:17 -0500, Tom Allison wrote:
> I'm working with the assumption that my archive of spam/ham has already
> been tuned/trained to such an extent that they are all seperated into
> two distinct ranges and that they can be represented successfully with
> distinct ham_cutoff (highest_ham+) and spam_cutoff (lowest_spam-) values.
Most people receive email that is not capable of such clean separation.
For example, the last fp I got was a hotel-reservation confirmation
that included a section containing gratuitous marketing drivel. At work,
where a proportion of our legitimate mail is concerned with selling our
products to interested customers, it's impossible to keep fp down
without allowing a certain amount of spam to get through. Such a hope
as you expressed (and I would be delighted if bogofilter can do that
for you) would not be realistic in regard to most email populations.
> Now that I have some 2800+ ham tokens and 2000+ spam tokens and ~2500
> each of ham/spam emails, I should hope to avoid, with certainty, the
> chance that a good email will score across the Unsure and all the way
> into the Spam group and similarly with spam doing the same.
I will say it again: statistical filtering is all about likelihood and
never about certainty. The only way to avoid all false positives with
certainty is to set the spam cutoff to 1.
> I have yet to see ham do that, but spam does it fairly regularly (once a
> week).
Well, I have three quarters of a million tokens, well over half a
million of which occur in spam only, and my training db was made from
23,471 spams and 21,660 nonspams. I haven't had a false positive for
nine weeks now, at some 1500 nonspams a day, and I get a similar very
low number of spam classified as nonspam. However, I do get around
0.5% of spams -- about two a day, since I get 400 to 500 spams daily --
turning up as unsure with the cutoffs I'm using at present. By
contrast (although it's not a fair comparison, because bogofilter has
improved a lot since then), when I had 2500 messages each, I was
getting a false positive every couple oof weeks and about 5% of spam
being delivered. Fortunately, a lot less spam was being sent me back
then.
--
| G r e g L o u i s | gpg public key: 0x400B1AA86D9E3E64 |
| http://www.bgl.nu/~glouis | (on my website or any keyserver) |
| http://wecanstopspam.org in signatures helps fight junk email. |
More information about the Bogofilter
mailing list