Importance of dot in TOKEN, revisited

Boris 'pi' Piwinger 3.14 at logic.univie.ac.at
Wed Mar 24 14:37:33 CET 2004


Hi!

Last week I had two experiments with my lexer where I found
not evidence that a) dot is a useful character inside a
TOKEN and b) IP addresses (even with block_subnets) also
don't help. So I decided to remove both (or not reinclude
the former) in my version of the lexer. Now I want to do a
final test comparing my (new) version of the lexer with one
which does again allow dots inside; I don't include an
explicit rule for IP addresses, but they will be recognized
as TOKENs (as any other number sequence with dots in it). I
now don't expect that this is helpful, but just to be sure.

Training is done on 10k messages each.

First observation is the number of messages used for
training to exhaustion:
without dot: 331 spam / 206 ham
with dot:    347 spam / 182 ham
This indicates that dots are indeed useful for recognizing
ham, but bad for recognizing spam. We'll see ...

Interesting of course is how good those trainings are with
respect to classifying messages. So here are the false
counts of ham:2120 ham and 5544 spam messages:

           | size kb | fp |  fn
TTE        |  1844   |  3 | 113
TTE (dot)  |  1944   |  3 | 103
full       | 12652   |  2 | 227
full (dot) | 14200   |  2 | 251

So in one case with dots it is slightly better, in the other
case slightly worse. So I conclude it really does not
matter. I'll go for simplicity and keep my lexer unchanged.


The following might be interesting:
bogofilter is my lexer, bogodot the one with dots inside and
bogotest the standard lexer.

File sizes:
629k bogodot/bogofilter
625k bogofilter/bogofilter
642k bogotest/bogofilter
244k bogodot/bogolexer
240k bogofilter/bogolexer
257k bogotest/bogolexer

`size`
  text    data     bss     dec     hex filename
150269   24212   18176  192657   2f091 bogodot/bogofilter
147805   24212   18176  190193   2e6f1 bogofilter/bogofilter
164589   24212   18176  206977   32881 bogotest/bogofilter
 58967    2052   12868   73887   1209f bogodot/bogolexer
 56503    2052   12868   71423   116ff bogofilter/bogolexer
 73287    2052   12868   88207   1588f bogotest/bogolexer

pi




More information about the Bogofilter mailing list