[bogofilter] ESF and redundancy
David Relson
relson at osagesoftware.com
Wed May 12 13:58:30 CEST 2004
On 12 May 2004 07:39:25 -0400
Tom Anderson wrote:
> On Tue, 2004-05-11 at 21:48, michael at optusnet.com.au wrote:
> > Indeed. I quite like the way CRM114 does things (and indeed
> > I posted a patch some time ago that implemented the 'lossy database'
> > idea).
>
> Let's discuss this for inclusion in the future versions of bogofilter
> then.
>
> > This bit of CRM114 I'm not sure about. With word pairs, the number
> > of heavily imbalanced scores (i.e. always ham or always spam) is
> > quite high. The longer the phrase is, the more likely that it's
> > always ham or spam. So that tends to automatically place more weight
> > on longer phrases.
>
> Given a phrase like "your document is attached", assume that each
> individual word is significantly hammy. In order for a combined
> phrase to then outweigh the hammy score of the individual tokens, it
> must have a much higher weight than the sum of the individual tokens.
> Otherwise you're only cancelling out the effect, producing a neutral
> (or even just less hammy) score rather than a spam score. This is not
> the desired behavior. If we see a four-word phrase that we know is
> spammy, then we want it to contribute significantly to the final
> score.
>
> Tom
The bayesian principle on which bogofilter is based, assigns scores to
each token and computes a final score (via inverse chi-square test) from
the number of tokens and their scores. "your document is attached" may
have a score of 1.000000, but bayesian doesn't consider that phrase any
more important than any other token.
What counts is the preponderance of evidence. Conceivably one could
assign a 4 word phrase an importance of 4 and use it 4 times in the
computation. Undoubtedly there are a zillion other ways to change the
importance of a token (or phrase).
More information about the Bogofilter
mailing list