Importance of IP addresses in lexer

Fri Mar 19 19:20:47 CET 2004

Hi!

After I came to the conclusion that dots in the middle of
TOKEN are of no importance (and hence the recognition of
host names!), I asked if IP addresses are.

For my test I use my radically simplified lexer with
bogofilter. This essentially declares TOKEN to be:
[^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]=()+?:#$._!'`~-]+

It still has the IP address rule of the standard lexer. I
test three different settings, each with training to
exhaustion and full training on 10k ham and spam messages
each for training. First I do my lexer. Then I do a modified
lexer with the IP address rule removed (note that numbers
are still allowed). And finally my lexer with the IP address
rule plus the --block_on_subnets=yes option.

First observation is the number of messages used for
training to exhaustion:
just IP:         342 spam / 185 ham
no IP:           336 spam / 187 ham
IP plus subnets: 317 spam / 178 ham

So the last methods needs the fewest messages. This suggests
this is most efficient. Let's see if this is still true when
meeting new messages ...

BTW: The third is about 4% slower than the second.

Here are the false counts of 1582 ham and 4458 spam
messages:

           | size kb | fp |  fn
TTE        |  1916   |  2 |  83
TTE (noip) |  1844   |  2 |  71
TTE (subn) |  1892   |  2 |  76
full       | 13616   |  2 | 184
full (noip)| 12652   |  2 | 182
full (subn)| 14236   |  3 | 185

Now this is a surprise. It is not much, but the lexer
without IP addresses outperformed all others. The use of
subnet blocking was not helpful, it even gave a higher rate
of false positives. Of course, I did not have a lot of
messages to test, but this is what I get. So I will work
using an even simpler lexer in the future. Remember, I still
get the single bytes of the IP addresses with my lexer.
ou'll find it (give me a few minutes;-) on
http://piology.org/bogofilter/.

pi