Algorithm limitations.
michael at optusnet.com.au
michael at optusnet.com.au
Sun Apr 11 08:25:35 CEST 2004
(I think this got lost in the mailing list changeover).
There's a couple of things that it isn't possible for
bogofilter to learn.
1. The absense of a feature.
2. the XOR problem.
At the moment, there's no way for bogofilter to learn that
the absense of a word is a ham/spam indicator.
Similarly, it's not possible for bogofilter to learn
that (say) "A A" and "B B" are both ham, but "A B" or "B A"
are both spam.
I wanted to get an idea of how important #1 was so I inverted
my database. That is, I set all the counts to ( total message - count )
and ran all the token through that new database to get the new probabilities.
Most significant tokens were:
spam good Gra prob Rob/Fish
http 656 737 0.286693 0.289406
href 1691 1346 0.361954 0.362510
rcvd:with 2 4 0.184189 0.373429
rcvd:Received 0 2 0.400000 0.380872
rcvd:from 0 2 0.400000 0.380872
rtrn:Return-Path 0 2 0.400000 0.380872
rcvd:SMTP 1577 1159 0.380575 0.380981
head:Date 3 4 0.252985 0.383652
[...]
for 2034 643 0.588203 0.585762
and 1672 515 0.594484 0.591384
this 2611 802 0.595152 0.593145
head:charset 2113 584 0.620316 0.617379
rcvd:ESMTP 2503 632 0.641362 0.638531
the 1529 366 0.653546 0.648610
rcvd:for 1297 227 0.720669 0.712499
Not too many suprises. the absense of 'http' in the body is a fairly
strong ham indicator. likewise 'href'.
Overall, not too many strong features. (Total token count in this
database was about 2 million, with about 5000 spams and 2500 hams).
Now, how can I test how important the XOR problem? :)
Michael.
More information about the Bogofilter
mailing list