multi-word token results
relson at osagesoftware.com
Tue Aug 8 22:52:45 EDT 2006
Recently several several new capabilities have been added to the cvs
version of bogofilter. They provide support for the
--min-token-len=N min len for single tokens
--max-token-len=N max len for single tokens
--max-multi-token-len=N max len for multi-word tokens
--multi-token-count=N number of tokens per multi-word
where N represents a number. The multi-token-count option allows
message scoring (spam vs ham) to be based on word pairs, triples,
quads, etc. The first 3 options are (more or less) present to support
the multi-token-count option.
Using "--multi-token-count=1" will produce the tokens you've seen for
years. Using "--multi-token-count=2" will produce tokens like
"first*second" and "part-1*part-2". Using "--multi-token-count=3" will
produce "first*second*third", "part-1*part-2*part-3", etc.
Naturally a question to ask is whether these options are of value. To
help answer that option, I created several wordlists (with different
values for multi-token-count) and scored some messages.
Here are details and numbers:
Over approx 4 yrs of bogofilter usage, I've accumulated a corpus of
235404 ham and 461881 spam. All these messages have been registered
into 3 separate wordlists. The first wordlist uses bogofilter's normal
parsing and tokenizing. The second wordlist uses word pairs (as
generated by bogofilter 1.1.0.cvs with the "--multi-token-count=2"
option). The third wordlist uses the "--multi-token-count=3" option..
Here are the database sizes (number of tokens and size of
1 5,859,086 158 Mb
2 28,449,518 873 Mb
3 62,606,888 2.1 Gb
As you can see, increasing the value of multi-token-count caused a
significant increase in token count and wordlist size.
Having created the wordlists, I scored all the messages and tabulated
mtc hH hU hS sH sU sS
corr uns FP FN uns corr
1 235103 301 0 287 7907 453687
2 235133 271 0 160 7079 454642
3 235204 199 1 69 4221 457591
The headings indicate my manual classification ("h" or "s") and
bogofilter's classification ("H", "U", or "S").
Column 1 has the value of multi-token-count.
Column 2 shows ham scored as ham, i.e. correct
Column 3 shows ham scored as unsure.
Column 4 shows ham scored as spam, i.e false positives.
Column 5 shows spam scored as ham, i.e false negatives.
Column 6 shows spam scored as unsure.
Column 7 shows spam scored as spam, i.e correct
As can be seen most messages are correctly classified (columns hH and
sS). Some are incorrectly classified (hS, i.e false positives, and
sH, i.e. false negatives). The remainder are classified as "unsure".
Also noticeable is that higher values of mtc produce more correct
results, fewer unsures, and fewer false negatives.
The one exception to "mtc is always more accurate" is the 1 FP for
mtc=3. The message in question was sent on Nov 28, 2003 to the
SpamBayes mailing list. Its subject is "[Spambayes] A SPAM Message
with a 0% Score". As can be deduced from the subject, the message
contains the full content of a spam message. As the message is
_about_ spam, I rated it as "ham". Bogofilter with mtc=3 considered
the spam portion to be more important, hence classified it as spam.
The increase in multi-token-count value also significantly increases
bogofilter's memory footprint (when scoring large messages or
registering large numbers of messages) and significantly increases
bogofilter's processing time.
Bogofilter's default behavior will remain unchanged to maintain speed
and conserve wordlist space. The new capabilities have been
implemented for those whose who want them, because I thought the
implementation would be interesting, because I've wanted to implement
the features for some time, and because I've promised the features to
the bogofilter community.
I'll build the final files and release these new capabilities in
bogofilter 1.1.0 in the next day or so.
More information about the Bogofilter-dev