[bogofilter] ESF and redundancy
David Relson
relson at osagesoftware.com
Tue May 11 18:43:51 CEST 2004
On 11 May 2004 08:17:40 -0400
Tom Anderson wrote:
> On Mon, 2004-05-10 at 19:33, michael at optusnet.com.au wrote:
> > a) A markovian would require a fairly large database. Something in
> > the order of around 9,000,000,000 elements for a naive 2-d english
> > preditor. (approx 30k words in english; squared).
>
> I'd imagine so. What size databases does CRM114 create on average?
> I'd guess it would need to be pruned fairly aggressively.
>
> > b) The larger the text, the closer the markovian predictor will
> > match the trivial ESF average. (By the weak law of larger numbers).
>
> Most emails are not very large. Most of the spams that still get
> through for me are a very short virus message with very similar
> wording("your file is attached", etc), but apparently they have enough
> header tokens (content-type, domain, date, etc) to offset the
> spaminess. I think using multiple-token phrases will help here.
>
> > The better way of doing this is via the word pair type approach I
> > think.
>
> Then you limit accuracy. The whole point of Markovian is to rank
> longer phrases with superincreasing weight. I think that the
> functionality should exist to create a window of whatever size you
> want. So if you want only two-word phrases, then it could be set in
> the config. For those that want to experiment with longer window
> sizes could do so as well. Just my suggestion based on the results
> provided by the article previously posted here. It was my
> understanding that the 1.0 release was to be a single-token filter,
> and we would try multiple afterwards.
Tom,
That's the plan -- 1 token now, 2 later. It can be modified to "n
tokens". Perhaps now's the time to buy some stock in disk drive
companies :-)
David
More information about the Bogofilter
mailing list