randomtrain observation

Sun May 18 23:31:21 CEST 2003

Greetings,

As y'all may know, in bogofilter's contrib directory in a script named 
randomtrain.  Given one or more mbox files of ham and one or more of spam, 
it classifies each message.  If bogofilter's classification is wrong, 
bogofilter is trained on that message.  This technique is known as "train 
on error".

At the moment, I've got 14,484 spam and 34,156 ham which my mail server has 
received since it started running bogofilter last October.  I've got 
randomtrain running with the whole shebang.  As randomtrain runs, it prints 
how many spam it has classified, how many it has trained with (because of a 
classification error), how many ham it has classified, and how many it has 
trained with.

I'm using the a cvs snapshot - bogofilter-0.12.3.cvs with the following 
bogofilter.cf:

min_dev=0.35
ignore_case=no
header_line_markup=yes
tokenize_html_tags=yes
replace_nonascii_characters=yes

At the moment, the run is approx 75% complete, but the numbers are:

  spam  reg   good reg
10525 3099  24836  94

I've also used bogoutil and wc to print the number of tokens in each wordlist:

spamlist 64195
goodlist  9202

The training rate is approx 30% of spam and 4% of ham.  The wordcounts are 
approx 20 per spam message and 100 per ham message.

These numbers lead me to think that spam is much more varied in content 
than is ham, hence bogofilter needs many more spam tokens than ham tokens 
in order to classify messages correctly.

'Tis interesting that ham appears so much easier to classify correctly...

David