bogotune results

Thu Mar 25 04:18:02 CET 2004

On Wed, 24 Mar 2004 22:02:17 -0500
Tom Allison wrote:

> I'm working with the assumption that my archive of spam/ham has
> already been tuned/trained to such an extent that they are all
> seperated into two distinct ranges and that they can be represented
> successfully with distinct ham_cutoff (highest_ham+) and spam_cutoff
> (lowest_spam-) values.
> 
> Obviously anything in the future can cross these parameters, but we're
> 
> trustful that with sufficient number of tokens, the probability of
> this happening is increasingly small.
> 
> Now that I have some 2800+ ham tokens and 2000+ spam tokens and ~2500 
> each of ham/spam emails, I should hope to avoid, with certainty, the 
> chance that a good email will score across the Unsure and all the way 
> into the Spam group and similarly with spam doing the same.

Incoming email is a complex mix of tokens/features.  For example most of
my ham is text/plain (I think) while most of the spam includes text/html
(I think).  Several lists I'm on don't follow these rules.  Most weeks I
have an Unsure or two from Office Depot or Staples or ...  I faithfully
use them to train bogofilter but I suspect that cross-over messages
(html-ham) like these will always be a problem.  Since bogofilter is
catching 99% of my spam, these anomalies don't bother me.

I expect that I'll always have unsures and that the highest ham scores
will be above the lowest spam scores and vice versa.  I don't think
think this will change, though I wouldn't mind if it does.