FAQ: How to train

Tue Jul 29 11:14:32 CEST 2003

Hi!

I think it is time for the FAQ to describe the different
training methods. Let me give it a try:

 Operational Questions

<h2 id="training">How do I start my bogofilter training?</h2>

To classify messages as ham (non-spam) or spam bogofilter
needs to learn from your mails. To start with it is best to
have as large as possible collections of messages you know
for sure are ham or spam (errors here will cause problems
later, so try hard;-). Warning: Only use your mail; if you
use other collections (like you find on the web for spam)
this might cause bogofilter to draw wrong conclusion, after
all you want it to understand your mail.

Then you have basically three choices. For all of them it
works better if your training base (the above collections)
are bigger. The smaller it it the more errors in production
you have to expect. To give an example, assume you
collection is the two mbox files ham and spam.

<ul>
<li>Method 1) Train bogofilter with all your messages. In
our example:
<pre>bogofilter -Ms <spam
bogofilter -Mn <ham</pre></li>

<li>Method 2) Use the script randomtrain (in the contrib
directory). This will use a train-on-error concept, i.e.,
only those messages are added to the database which
bogofilter does not know already how to handle correctly.
The messages are checked in random order. [When does this
script stop? Is each message checked only once?] This
produces a much smaller database than the previous method,
but it seems to work even better in production. In our
example:
<pre>randomtrain -s spam -n ham</pre></li>

<li>Method 3) Use the script bogominitrain.pl (in the
contrib directory). This will also use a train-on-error
concept, but here the messages are checked in the order of
your mboxes. You should use the -f option which will repeat
this until all messages in you training collection are
classified correctly (you can even adjust the level of
certainty). Test show that this generates the smallest
database of all methods. But since the script makes sure the
database knows "everything" about your training collection
with a precision of your choice, it works very well. In our
example (with spam_cutoff=0.6 in your config file):
<pre>bogominitrain.pl -fv ~/.bogofilter ham spam '-o
0.7,0.5'</pre></li>
</ul>

<hr>

<h2 id="production">How do I keep it going?</h2>

<p>When you use bogofilter it will make mistakes. So you
need to continue your training. There are two concepts here:
The first is to train with every incoming message (using the
-u option). The second is to train on error only. The first
would only make sense with method 1 described above.</p>

<p>Since you might want to rebuild your database at some
point (some changes in bogofilter suggest that), you should
continously add to your training database.</p>

Whatever you do bogofilter will make mistakes, i.e.,
classify ham as spam (false positives) or spam as ham (false
negatives). You need to correct these errors. If you train
with every message you first need to undo this wrong
classification (-S/-N). Then you tell bogofilter the correct
answer by training with the mistakenly classified message
using -n/-s.

The smaller your database the greater the risk of this
training to have an adverse effect on the other side. When
you train with another spam message this might make some
other ham message look more spammish and vice versa.

If you use method 3 above you can compensate this effect,
by again doing the training with your complete training
collection (don't forget to add the new messages to that
collection). This will add messages to the database which
show that adverse effect on both sides until you have a new
equilibrium.

pi