Algorithm limitations.

michael at optusnet.com.au michael at optusnet.com.au
Wed Apr 14 01:57:59 CEST 2004


Tom Anderson <tanderso at oac-design.com> writes:
> On Sun, 2004-04-11 at 02:25, michael at optusnet.com.au wrote:
> > At the moment, there's no way for bogofilter to learn that
> > the absense of a word is a ham/spam indicator.
> 
> Doesn't it already do this by ranking the hamminess of the tokens which
> DO appear?  The lack of strong ham indicators is certainly represented
[..]
> > Not too many suprises. the absense of 'http' in the body is a fairly
> > strong ham indicator. likewise 'href'.
> 
> Precisely, so if 'http' and 'href' are missing, they will not contribute
> their spaminess to the final score.  Thus the message will be hammier,
> and your goal is achieved.

Not really. Consider the message "xyzzy". (say) That token isn't
in the database, so it gets robx as a value. So the spaminess would
be (say) 0.5 (i.e. no idea if it's ham or spam)

However, the message is missing both 'href' and 'http'. So the
spaminess should really be ~ 0.1 (i.e. fairly sure it's not spam).



Michael.




More information about the Bogofilter mailing list