[bogofilter] ESF and redundancy

Tom Anderson tanderso at oac-design.com
Wed May 12 13:39:25 CEST 2004


On Tue, 2004-05-11 at 21:48, michael at optusnet.com.au wrote:
> Indeed.  I quite like the way CRM114 does things (and indeed
> I posted a patch some time ago that implemented the 'lossy database'
> idea).

Let's discuss this for inclusion in the future versions of bogofilter
then.
 
> This bit of CRM114 I'm not sure about. With word pairs, the number of
> heavily imbalanced scores (i.e. always ham or always spam) is quite
> high. The longer the phrase is, the more likely that it's always ham
> or spam. So that tends to automatically place more weight on 
> longer phrases.

Given a phrase like "your document is attached", assume that each
individual word is significantly hammy.  In order for a combined phrase
to then outweigh the hammy score of the individual tokens, it must have
a much higher weight than the sum of the individual tokens.  Otherwise
you're only cancelling out the effect, producing a neutral (or even just
less hammy) score rather than a spam score.  This is not the desired
behavior.  If we see a four-word phrase that we know is spammy, then we
want it to contribute significantly to the final score.

Tom





More information about the Bogofilter mailing list