bogotune results
David Relson
relson at osagesoftware.com
Thu Mar 25 04:18:02 CET 2004
On Wed, 24 Mar 2004 22:02:17 -0500
Tom Allison wrote:
> I'm working with the assumption that my archive of spam/ham has
> already been tuned/trained to such an extent that they are all
> seperated into two distinct ranges and that they can be represented
> successfully with distinct ham_cutoff (highest_ham+) and spam_cutoff
> (lowest_spam-) values.
>
> Obviously anything in the future can cross these parameters, but we're
>
> trustful that with sufficient number of tokens, the probability of
> this happening is increasingly small.
>
> Now that I have some 2800+ ham tokens and 2000+ spam tokens and ~2500
> each of ham/spam emails, I should hope to avoid, with certainty, the
> chance that a good email will score across the Unsure and all the way
> into the Spam group and similarly with spam doing the same.
Incoming email is a complex mix of tokens/features. For example most of
my ham is text/plain (I think) while most of the spam includes text/html
(I think). Several lists I'm on don't follow these rules. Most weeks I
have an Unsure or two from Office Depot or Staples or ... I faithfully
use them to train bogofilter but I suspect that cross-over messages
(html-ham) like these will always be a problem. Since bogofilter is
catching 99% of my spam, these anomalies don't bother me.
I expect that I'll always have unsures and that the highest ham scores
will be above the lowest spam scores and vice versa. I don't think
think this will change, though I wouldn't mind if it does.
More information about the Bogofilter
mailing list