[bogofilter] ESF and redundancy

Wed May 12 13:58:30 CEST 2004

On 12 May 2004 07:39:25 -0400
Tom Anderson wrote:

> On Tue, 2004-05-11 at 21:48, michael at optusnet.com.au wrote:
> > Indeed.  I quite like the way CRM114 does things (and indeed
> > I posted a patch some time ago that implemented the 'lossy database'
> > idea).
> 
> Let's discuss this for inclusion in the future versions of bogofilter
> then.
>  
> > This bit of CRM114 I'm not sure about. With word pairs, the number
> > of heavily imbalanced scores (i.e. always ham or always spam) is
> > quite high. The longer the phrase is, the more likely that it's
> > always ham or spam. So that tends to automatically place more weight
> > on longer phrases.
> 
> Given a phrase like "your document is attached", assume that each
> individual word is significantly hammy.  In order for a combined
> phrase to then outweigh the hammy score of the individual tokens, it
> must have a much higher weight than the sum of the individual tokens. 
> Otherwise you're only cancelling out the effect, producing a neutral
> (or even just less hammy) score rather than a spam score.  This is not
> the desired behavior.  If we see a four-word phrase that we know is
> spammy, then we want it to contribute significantly to the final
> score.
> 
> Tom

The bayesian principle on which bogofilter is based, assigns scores to
each token and computes a final score (via inverse chi-square test) from
the number of tokens and their scores.  "your document is attached" may
have a score of 1.000000, but bayesian doesn't consider that phrase any
more important than any other token.

What counts is the preponderance of evidence.  Conceivably one could
assign a 4 word phrase an importance of 4 and use it 4 times in the
computation.  Undoubtedly there are a zillion other ways to change the
importance of a token (or phrase).