Algorithm limitations.
michael at optusnet.com.au
michael at optusnet.com.au
Wed Apr 14 01:57:59 CEST 2004
Tom Anderson <tanderso at oac-design.com> writes:
> On Sun, 2004-04-11 at 02:25, michael at optusnet.com.au wrote:
> > At the moment, there's no way for bogofilter to learn that
> > the absense of a word is a ham/spam indicator.
>
> Doesn't it already do this by ranking the hamminess of the tokens which
> DO appear? The lack of strong ham indicators is certainly represented
[..]
> > Not too many suprises. the absense of 'http' in the body is a fairly
> > strong ham indicator. likewise 'href'.
>
> Precisely, so if 'http' and 'href' are missing, they will not contribute
> their spaminess to the final score. Thus the message will be hammier,
> and your goal is achieved.
Not really. Consider the message "xyzzy". (say) That token isn't
in the database, so it gets robx as a value. So the spaminess would
be (say) 0.5 (i.e. no idea if it's ham or spam)
However, the message is missing both 'href' and 'http'. So the
spaminess should really be ~ 0.1 (i.e. fairly sure it's not spam).
Michael.
More information about the Bogofilter
mailing list