[bogofilter] Improved Calculations
David Relson
relson at osagesoftware.com
Thu May 13 13:29:37 CEST 2004
On 13 May 2004 12:33:57 +1000
michael at optusnet.com.au wrote:
> "Boris 'pi' Piwinger" <3.14 at logic.univie.ac.at> writes:
> > David Relson wrote:
> >
> > > Gary Robinson's blog has a new "Improved Chi" article (at
> > > http://www.garyrobinson.net/2004/04/improved_chi.html) which
> > > points to his new paper, "Handling Redundancy in Email Token
> > > Probabilities" (at
> > > http://garyrob.blogs.com//handlingtokenredundancy93.pdf).
> >
> > I read those. I have to say, that -- without knowing the
> > theory -- this is something I really don't understand. The
> > new parameters may have some intuition, but how they are
> > used is magic. That makes it hard to guess good values, as
> > the article explains, only testing seems to work.
>
> IMHO, at least the ham parameter should be computable from
> the informational entropy of english text.
>
> And presumably the spam parameter would be the computable
> by a similar mean on a sufficent quantity of spam text.
>
> But the actual mapping from the entropy values to the ESF value
> escapes me.
>
> Michael.
Hi Michael,
I've not yet found the actual mapping from wordlist to robs, robx,
min_dev, etc. We've got a scanning tool in bogotune that will
empirically find an answer. Wouldn't it be nice to have a mathematical
formula?
Finding the ESF values is no harder and no easier. Now that bogotune is
scanning for 5 parameters (rather than 3), the scan time has increased
dramatically.
Wish we had that formula!
Regards,
David
More information about the Bogofilter
mailing list