[bogofilter] ESF and redundancy

michael at optusnet.com.au michael at optusnet.com.au
Thu May 13 04:30:48 CEST 2004


Tom Anderson <tanderso at oac-design.com> writes:
> On Tue, 2004-05-11 at 21:48, michael at optusnet.com.au wrote:
[...]
> > This bit of CRM114 I'm not sure about. With word pairs, the number of
> > heavily imbalanced scores (i.e. always ham or always spam) is quite
> > high. The longer the phrase is, the more likely that it's always ham
> > or spam. So that tends to automatically place more weight on 
> > longer phrases.
> 
> Given a phrase like "your document is attached", assume that each
> individual word is significantly hammy.  In order for a combined phrase
> to then outweigh the hammy score of the individual tokens, it must have
> a much higher weight than the sum of the individual tokens.  Otherwise
> you're only cancelling out the effect, producing a neutral (or even just
> less hammy) score rather than a spam score.  This is not the desired
> behavior.  If we see a four-word phrase that we know is spammy, then we
> want it to contribute significantly to the final score.

My point here is that the word 'document' will likely wind up
with (say) a 0.3 ham rating as a result of appearing in both
ham and spam. But the longer phrase is likely to appear
in spam only as will wind up at 0.98 (say). The normal algorithm
will place more weight on the longer phrase automagically (as it's
a more extreme value).

Michael.



More information about the Bogofilter mailing list