randomtrain observation
David Relson
relson at osagesoftware.com
Sun May 18 23:31:21 CEST 2003
Greetings,
As y'all may know, in bogofilter's contrib directory in a script named
randomtrain. Given one or more mbox files of ham and one or more of spam,
it classifies each message. If bogofilter's classification is wrong,
bogofilter is trained on that message. This technique is known as "train
on error".
At the moment, I've got 14,484 spam and 34,156 ham which my mail server has
received since it started running bogofilter last October. I've got
randomtrain running with the whole shebang. As randomtrain runs, it prints
how many spam it has classified, how many it has trained with (because of a
classification error), how many ham it has classified, and how many it has
trained with.
I'm using the a cvs snapshot - bogofilter-0.12.3.cvs with the following
bogofilter.cf:
min_dev=0.35
ignore_case=no
header_line_markup=yes
tokenize_html_tags=yes
replace_nonascii_characters=yes
At the moment, the run is approx 75% complete, but the numbers are:
spam reg good reg
10525 3099 24836 94
I've also used bogoutil and wc to print the number of tokens in each wordlist:
spamlist 64195
goodlist 9202
The training rate is approx 30% of spam and 4% of ham. The wordcounts are
approx 20 per spam message and 100 per ham message.
These numbers lead me to think that spam is much more varied in content
than is ham, hence bogofilter needs many more spam tokens than ham tokens
in order to classify messages correctly.
'Tis interesting that ham appears so much easier to classify correctly...
David
More information about the Bogofilter
mailing list