Training without ham.
Jesse Trucks
jesse at cyberius.net
Mon Sep 8 05:19:29 CEST 2003
On Sun, 7 Sep 2003, David Relson wrote:
> On Sun, 7 Sep 2003 19:05:52 +0200 (CEST)
>
>
> BF compares the tokens to ham and spam lists and determines which one
> matches better. If you only train on spam, the comparison becomes one
> of "known" words (which are all spam) and "unknown" words. As the
> ham/spam comparison is lost, the results can't be good.
>
> I don't think it would work :-(
It generates a great deal of false positives for the first several days or weeks (the time being
dependant on the volume of mail coming into the server), and so prepare for a lot of manual
intervention. After staying on top of it for a couple weeks, it works fine.
On a second implementation, I just cleansed a saved copy of my own HAM of anything sensitive and used
it to seed the ham database. When using a similar size file for each of HAM and SPAM in this fashion,
things worked much better.
--
Jesse Trucks jesse at cyberius.net
Cyberius' Network http://www.cyberius.net/
GCUX - GIAC Certified Unix Security Administrator
More information about the Bogofilter
mailing list