Importance of dot in TOKEN

Fri Mar 19 13:14:56 CET 2004

Hi!

As I described to the list, I use a radically simplified
lexer with bogofilter (with great success BTW;-). This
essentially declares TOKEN to be:
[^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]=()+?:#$._!'`~-]+

Now I once in a while observed, that a message would
probably rated quite differently if host names were
recognized, which fails, because . is not allowed in TOKEN.
So I ask, is this a useful feature or not. What I did in
this test is just use my radical lexer and a copy which
allows for . in the middle of a TOKEN (not as first or last
character!). I built four databases. Each used 10k ham and
spam each for training. I used training to exhaustion and
full training with both versions of the lexer each.

First observation is the number of messages used for
training to exhaustion:
without dot: 342 spam / 185 ham
with dot:    351  spam / 185 ham
So it does not seem to make much of a difference here.

Interesting of course is how good those trainings are with
respect to classifying messages. So here are the false
counts of 1544 ham and 4417 spam messages:

           | size kb | fp |  fn
TTE        |  1912   |  2 |  82
TTE (dot)  |  1892   |  1 |  98
full       | 13612   |  2 | 182
full (dot) | 14204   |  2 | 193

So while there are only pretty few test messages, there is
only little to observe. There is a very, very small
indication that . might help in avoiding fp's. The number of
fn's seems reduced a bit by *not* using dots. This is a
surprise, I expected the dot version to clearly outperform
the much simpler lexer. It does not. So I gonna keep it out.

With this result in mind it will be interesting to see if IP
numbers are really useful. I'll keep you posted.

pi