[bogofilter] Are token pairs useful

Fri Apr 16 07:52:04 CEST 2004

Hi!

Using the patch David provided for token pairs
(http://article.gmane.org/gmane.mail.bogofilter.general/7580)
I tested how this feature compares to the single token
approach. In both cases I used my version of the lexer
(http://piology.org/bogofilter/). I used 10000 ham mails and
15000 spam mails for training and 4556 ham mails and 5500
spam mails for testing. With both versions I did full
training and training to exhaustion.

The first question is how much the size of the database will
change (databases compacted):

Training | Pairs |  ham  |  spam |  kb
---------+-------+-------+-------+------
TTE (4)  |   n   |   225 |   433 |  2212
---------+-------+-------+-------+------
TTE (4)  |   y   |   222 |   378 | 10216
---------+-------+-------+-------+------
Full     |   n   | 10000 | 15000 | 15664
---------+-------+-------+-------+------
Full     |   y   | 10000 | 15000 | 92788

More important: Do we get better results?

Training | Pairs | fp / | fn /
         |       | 4556 | 5500
---------+-------+------+-------
TTE      |   n   |   16 |   30  
---------+-------+------+-------
TTE      |   y   |   21 |   29  
---------+-------+------+-------
Full     |   n   |   18 |  150  
---------+-------+------+-------
Full     |   y   |   24 |  106  

Findings: Taken pairs of tokens is very expensive in
database size. Full training becomes unacceptable.

We can observe a tendency that spam is better recognized
(fewer messages used in TTE, less fn), but this effect is
not worth the effort, if at all then with full training.

Ham on the other hand fails badly. There are many more false
positives.

Taking all this together, paired tokens seem not useful.

It may well be that larger windows (like three to five
tokens) might be useful, especially when allowing skipping
of words.

pi