multi-word token results

Wed Aug 9 04:52:45 CEST 2006

Greetings,

Recently several several new capabilities have been added to the cvs
version of bogofilter.  They provide support for the
following options:

      --min-token-len=N               min len for single tokens
      --max-token-len=N               max len for single tokens
      --max-multi-token-len=N         max len for multi-word tokens
      --multi-token-count=N           number of tokens per multi-word
token

where N represents a number.  The multi-token-count option allows
message scoring (spam vs ham) to be based on word pairs, triples,
quads, etc.  The first 3 options are (more or less) present to support
the multi-token-count option.

Using "--multi-token-count=1" will produce the tokens you've seen for
years.  Using "--multi-token-count=2" will produce tokens like
"first*second" and "part-1*part-2".  Using "--multi-token-count=3" will
produce "first*second*third", "part-1*part-2*part-3", etc.

Naturally a question to ask is whether these options are of value.  To
help answer that option, I created several wordlists (with different
values for multi-token-count) and scored some messages.  

Here are details and numbers:

Over approx 4 yrs of bogofilter usage, I've accumulated a corpus of
235404 ham and 461881 spam.  All these messages have been registered
into 3 separate wordlists.  The first wordlist uses bogofilter's normal
parsing and tokenizing.  The second wordlist uses word pairs (as
generated by bogofilter 1.1.0.cvs with the "--multi-token-count=2"
option).  The third wordlist uses the "--multi-token-count=3" option..

Here are the database sizes (number of tokens and size of
wordlist.db):

   1     5,859,086    158 Mb
   2    28,449,518    873 Mb
   3    62,606,888    2.1 Gb

As you can see, increasing the value of multi-token-count caused a
significant increase in token count and wordlist size.  

Having created the wordlists, I scored all the messages and tabulated
the results:

 mtc     hH      hU      hS      sH      sU      sS
        corr     uns     FP      FN     uns    corr
   1  235103     301      0     287    7907  453687
   2  235133     271      0     160    7079  454642
   3  235204     199      1      69    4221  457591

The headings indicate my manual classification ("h" or "s") and
bogofilter's classification ("H", "U", or "S").

Column 1 has the value of multi-token-count.
Column 2 shows ham scored as ham, i.e. correct
Column 3 shows ham scored as unsure.
Column 4 shows ham scored as spam, i.e false positives.
Column 5 shows spam scored as ham, i.e false negatives.
Column 6 shows spam scored as unsure.
Column 7 shows spam scored as spam, i.e correct

As can be seen most messages are correctly classified (columns hH and
sS).  Some are incorrectly classified (hS, i.e false positives, and
sH, i.e. false negatives).  The remainder are classified as "unsure".

Also noticeable is that higher values of mtc produce more correct
results, fewer unsures, and fewer false negatives.  

The one exception to "mtc is always more accurate" is the 1 FP for
mtc=3.  The message in question was sent on Nov 28, 2003 to the
SpamBayes mailing list.  Its subject is "[Spambayes] A SPAM Message
with a 0% Score".  As can be deduced from the subject, the message
contains the full content of a spam message.  As the message is
_about_ spam, I rated it as "ham".  Bogofilter with mtc=3 considered
the spam portion to be more important, hence classified it as spam.

The increase in multi-token-count value also significantly increases
bogofilter's memory footprint (when scoring large messages or
registering large numbers of messages) and significantly increases
bogofilter's processing time.  

Bogofilter's default behavior will remain unchanged to maintain speed
and conserve wordlist space.  The new capabilities have been
implemented for those whose who want them, because I thought the
implementation would be interesting, because I've wanted to implement
the features for some time, and because I've promised the features to
the bogofilter community.

I'll build the final files and release these new capabilities in
bogofilter 1.1.0 in the next day or so.

Regards,

David