Importance of dot in TOKEN

Boris 'pi' Piwinger 3.14 at logic.univie.ac.at
Fri Mar 19 13:14:56 CET 2004


Hi!

As I described to the list, I use a radically simplified
lexer with bogofilter (with great success BTW;-). This
essentially declares TOKEN to be:
[^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]=()+?:#$._!'`~-]+

Now I once in a while observed, that a message would
probably rated quite differently if host names were
recognized, which fails, because . is not allowed in TOKEN.
So I ask, is this a useful feature or not. What I did in
this test is just use my radical lexer and a copy which
allows for . in the middle of a TOKEN (not as first or last
character!). I built four databases. Each used 10k ham and
spam each for training. I used training to exhaustion and
full training with both versions of the lexer each.

First observation is the number of messages used for
training to exhaustion:
without dot: 342 spam / 185 ham
with dot:    351  spam / 185 ham
So it does not seem to make much of a difference here.

Interesting of course is how good those trainings are with
respect to classifying messages. So here are the false
counts of 1544 ham and 4417 spam messages:

           | size kb | fp |  fn
TTE        |  1912   |  2 |  82
TTE (dot)  |  1892   |  1 |  98
full       | 13612   |  2 | 182
full (dot) | 14204   |  2 | 193

So while there are only pretty few test messages, there is
only little to observe. There is a very, very small
indication that . might help in avoiding fp's. The number of
fn's seems reduced a bit by *not* using dots. This is a
surprise, I expected the dot version to clearly outperform
the much simpler lexer. It does not. So I gonna keep it out.

With this result in mind it will be interesting to see if IP
numbers are really useful. I'll keep you posted.

pi




More information about the Bogofilter mailing list