[bogofilter] ESF and redundancy

Tue May 11 18:43:51 CEST 2004

On 11 May 2004 08:17:40 -0400
Tom Anderson wrote:

> On Mon, 2004-05-10 at 19:33, michael at optusnet.com.au wrote:
> > a) A markovian would require a fairly large database. Something in
> > the order of around 9,000,000,000 elements for a naive 2-d english
> > preditor. (approx 30k words in english; squared).
> 
> I'd imagine so.  What size databases does CRM114 create on average? 
> I'd guess it would need to be pruned fairly aggressively.
> 
> > b) The larger the text, the closer the markovian predictor will
> > match the trivial ESF average. (By the weak law of larger numbers).
> 
> Most emails are not very large.  Most of the spams that still get
> through for me are a very short virus message with very similar
> wording("your file is attached", etc), but apparently they have enough
> header tokens (content-type, domain, date, etc) to offset the
> spaminess.  I think using multiple-token phrases will help here.
> 
> > The better way of doing this is via the word pair type approach I
> > think.
> 
> Then you limit accuracy.  The whole point of Markovian is to rank
> longer phrases with superincreasing weight.  I think that the
> functionality should exist to create a window of whatever size you
> want.  So if you want only two-word phrases, then it could be set in
> the config.  For those that want to experiment with longer window
> sizes could do so as well.  Just my suggestion based on the results
> provided by the article previously posted here.  It was my
> understanding that the 1.0 release was to be a single-token filter,
> and we would try multiple afterwards.

Tom,

That's the plan -- 1 token now, 2 later.  It can be modified to "n
tokens".  Perhaps now's the time to buy some stock in disk drive
companies :-)

David