[bogofilter] ESF and redundancy

Tue May 11 14:17:40 CEST 2004

On Mon, 2004-05-10 at 19:33, michael at optusnet.com.au wrote:
> a) A markovian would require a fairly large database. Something in
> the order of around 9,000,000,000 elements for a naive 2-d english
> preditor. (approx 30k words in english; squared).

I'd imagine so.  What size databases does CRM114 create on average?  I'd
guess it would need to be pruned fairly aggressively.

> b) The larger the text, the closer the markovian predictor will
> match the trivial ESF average. (By the weak law of larger numbers).

Most emails are not very large.  Most of the spams that still get
through for me are a very short virus message with very similar wording
("your file is attached", etc), but apparently they have enough header
tokens (content-type, domain, date, etc) to offset the spaminess.  I
think using multiple-token phrases will help here.

> The better way of doing this is via the word pair type approach I think.

Then you limit accuracy.  The whole point of Markovian is to rank longer
phrases with superincreasing weight.  I think that the functionality
should exist to create a window of whatever size you want.  So if you
want only two-word phrases, then it could be set in the config.  For
those that want to experiment with longer window sizes could do so as
well.  Just my suggestion based on the results provided by the article
previously posted here.  It was my understanding that the 1.0 release
was to be a single-token filter, and we would try multiple afterwards.

Tom