bogominitrain

David Relson relson at osagesoftware.com
Sat Oct 15 05:16:27 CEST 2005


On Sat, 15 Oct 2005 05:15:53 +0400
Milan Jovanovic wrote:

> DOes it make sense to train bogof. with bogominitrain.pl on same wordlist.db and the same spam and ham that you used first time to create that wordlist.db with -s and -n switches ?

No, I don't think it makes sense.  When you initially create a wordlist
with a bunch of ham and spam, you're (probably) creating a wordlist
that's larger than it _must_ be.  

Stated differently, if you create the wordlist using an optimal set of
spam and ham, you'll have all the words needed to do a very good job of
classification and you'll also have a small wordlist.  You can think of
this as using an ideal set of messages to create an ideal wordlist.

Using bogominitrain.pl, each message is classified.  Some will be
classified correctly and some won't be.  If you use the incorrectly
classified ones for training, you'll build a wordlist that has the
tokens needed to properly classify the ham and spam for _your_ site.

Bogominitrain.pl automates this process, hence is useful.  Generally
speaking it finds the "hard to classify" messages and adds them to the
wordlist.  Since only a subset of all the spam and ham are used for
training, the resulting wordlist will be smaller.  This can be a nice
thing.  

I suspect that bogominitrain's author, Boris "pi" Piwinger, will chime
in with more info about this tool's merits.

HTH,

David



More information about the Bogofilter mailing list