[bogofilter] ESF and redundancy
Tom Anderson
tanderso at oac-design.com
Wed May 12 13:39:25 CEST 2004
On Tue, 2004-05-11 at 21:48, michael at optusnet.com.au wrote:
> Indeed. I quite like the way CRM114 does things (and indeed
> I posted a patch some time ago that implemented the 'lossy database'
> idea).
Let's discuss this for inclusion in the future versions of bogofilter
then.
> This bit of CRM114 I'm not sure about. With word pairs, the number of
> heavily imbalanced scores (i.e. always ham or always spam) is quite
> high. The longer the phrase is, the more likely that it's always ham
> or spam. So that tends to automatically place more weight on
> longer phrases.
Given a phrase like "your document is attached", assume that each
individual word is significantly hammy. In order for a combined phrase
to then outweigh the hammy score of the individual tokens, it must have
a much higher weight than the sum of the individual tokens. Otherwise
you're only cancelling out the effect, producing a neutral (or even just
less hammy) score rather than a spam score. This is not the desired
behavior. If we see a four-word phrase that we know is spammy, then we
want it to contribute significantly to the final score.
Tom
More information about the Bogofilter
mailing list