[bogofilter] ESF and redundancy

Wed May 12 03:48:29 CEST 2004

Tom Anderson <tanderso at oac-design.com> writes:
> On Mon, 2004-05-10 at 19:33, michael at optusnet.com.au wrote:
> > a) A markovian would require a fairly large database. Something in
> > the order of around 9,000,000,000 elements for a naive 2-d english
> > preditor. (approx 30k words in english; squared).
> 
> I'd imagine so.  What size databases does CRM114 create on average?  I'd
> guess it would need to be pruned fairly aggressively.

Indeed.  I quite like the way CRM114 does things (and indeed
I posted a patch some time ago that implemented the 'lossy database'
idea).

> > b) The larger the text, the closer the markovian predictor will
> > match the trivial ESF average. (By the weak law of larger numbers).
> 
> Most emails are not very large.  Most of the spams that still get
> through for me are a very short virus message with very similar wording
> ("your file is attached", etc), but apparently they have enough header
> tokens (content-type, domain, date, etc) to offset the spaminess.  I
> think using multiple-token phrases will help here.

Note that the convergence to the mean is quite fast. 100 scoring
tokens is considered 'lots' in this context.

Having said that, I actually use multiple-tokens in the production
system because the gair in accuracy is so high.

> > The better way of doing this is via the word pair type approach I think.
> 
> Then you limit accuracy.  The whole point of Markovian is to rank longer
> phrases with superincreasing weight.

This bit of CRM114 I'm not sure about. With word pairs, the number of
heavily imbalanced scores (i.e. always ham or always spam) is quite
high. The longer the phrase is, the more likely that it's always ham
or spam. So that tends to automatically place more weight on 
longer phrases.

It's something I'd definately like to see more research on.

> I think that the functionality
> should exist to create a window of whatever size you want.  So if you
> want only two-word phrases, then it could be set in the config.  For
> those that want to experiment with longer window sizes could do so as
> well.  Just my suggestion based on the results provided by the article
> previously posted here.  It was my understanding that the 1.0 release
> was to be a single-token filter, and we would try multiple afterwards.

I'm eagerly awaiting that myself. :)

Michael.