Training without ham.

Jesse Trucks jesse at cyberius.net
Mon Sep 8 05:19:29 CEST 2003


On Sun, 7 Sep 2003, David Relson wrote:

> On Sun, 7 Sep 2003 19:05:52 +0200 (CEST)
>
>
> BF compares the tokens to ham and spam lists and determines which one
> matches better.  If you only train on spam, the comparison becomes one
> of "known" words (which are all spam) and "unknown" words.  As the
> ham/spam comparison is lost, the results can't be good.
>
> I don't think it would work :-(

It generates a great deal of false positives for the first several days or weeks (the time being
dependant on the volume of mail coming into the server), and so prepare for a lot of manual
intervention. After staying on top of it for a couple weeks, it works fine.

On a second implementation, I just cleansed a saved copy of my own HAM of anything sensitive and used
it to seed the ham database. When using a similar size file for each of HAM and SPAM in this fashion,
things worked much better.

-- 
Jesse Trucks	   	       jesse at cyberius.net
Cyberius' Network	 http://www.cyberius.net/
GCUX - GIAC Certified Unix Security Administrator





More information about the Bogofilter mailing list