Algorithm limitations.

Sun Apr 11 08:25:35 CEST 2004

(I think this got lost in the mailing list changeover).

There's a couple of things that it isn't possible for
bogofilter to learn.

1. The absense of a feature.
2. the XOR problem.

At the moment, there's no way for bogofilter to learn that
the absense of a word is a ham/spam indicator.

Similarly, it's not possible for bogofilter to learn
that (say) "A A" and "B B" are both ham, but "A B" or "B A"
are both spam.

I wanted to get an idea of how important #1 was so I inverted
my database. That is, I set all the counts to ( total message - count )
and ran all the token through that new database to get the new probabilities.

Most significant tokens were:
                                 spam    good  Gra prob  Rob/Fish
http                              656     737  0.286693  0.289406
href                             1691    1346  0.361954  0.362510
rcvd:with                           2       4  0.184189  0.373429
rcvd:Received                       0       2  0.400000  0.380872
rcvd:from                           0       2  0.400000  0.380872
rtrn:Return-Path                    0       2  0.400000  0.380872
rcvd:SMTP                        1577    1159  0.380575  0.380981
head:Date                           3       4  0.252985  0.383652
[...]
for                              2034     643  0.588203  0.585762
and                              1672     515  0.594484  0.591384
this                             2611     802  0.595152  0.593145
head:charset                     2113     584  0.620316  0.617379
rcvd:ESMTP                       2503     632  0.641362  0.638531
the                              1529     366  0.653546  0.648610
rcvd:for                         1297     227  0.720669  0.712499

Not too many suprises. the absense of 'http' in the body is a fairly
strong ham indicator. likewise 'href'.

Overall, not too many strong features. (Total token count in this
database was about 2 million, with about 5000 spams and 2500 hams).

Now, how can I test how important the XOR problem? :)

Michael.