Algorithm limitations.

Tue Apr 13 06:18:26 CEST 2004

On Sun, 2004-04-11 at 02:25, michael at optusnet.com.au wrote:
> At the moment, there's no way for bogofilter to learn that
> the absense of a word is a ham/spam indicator.

Doesn't it already do this by ranking the hamminess of the tokens which
DO appear?  The lack of strong ham indicators is certainly represented
in the final calculation.  If you were to somehow do an inverse
calculation where each ham token which was missing counted toward the
spaminess proportional to its haminess, and likewise with the spam
tokens, I'd wager you'd end up with nearly the same spamicity you get
with the current algorithm.

If you want to achieve something like noting whether the To: field is
missing, then I'd recommend a heuristical analysis ala SpamAssassin.

> Not too many suprises. the absense of 'http' in the body is a fairly
> strong ham indicator. likewise 'href'.

Precisely, so if 'http' and 'href' are missing, they will not contribute
their spaminess to the final score.  Thus the message will be hammier,
and your goal is achieved.

Tom
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <https://www.bogofilter.org/pipermail/bogofilter/attachments/20040413/df1f7153/attachment.sig>