[bogofilter] Improved Calculations

David Relson relson at osagesoftware.com
Thu May 13 13:29:37 CEST 2004


On 13 May 2004 12:33:57 +1000
michael at optusnet.com.au wrote:

> "Boris 'pi' Piwinger" <3.14 at logic.univie.ac.at> writes:
> > David Relson wrote:
> > 
> > > Gary Robinson's blog has a new "Improved Chi" article (at
> > > http://www.garyrobinson.net/2004/04/improved_chi.html) which
> > > points to his new paper, "Handling Redundancy in Email Token
> > > Probabilities" (at
> > > http://garyrob.blogs.com//handlingtokenredundancy93.pdf).  
> > 
> > I read those. I have to say, that -- without knowing the
> > theory -- this is something I really don't understand. The
> > new parameters may have some intuition, but how they are
> > used is magic. That makes it hard to guess good values, as
> > the article explains, only testing seems to work.
> 
> IMHO, at least the ham parameter should be computable from
> the informational entropy of english text.
> 
> And presumably the spam parameter would be the computable
> by a similar mean on a sufficent quantity of spam text.
> 
> But the actual mapping from the entropy values to the ESF value
> escapes me.
> 
> Michael.

Hi Michael,

I've not yet found the actual mapping from wordlist to robs, robx,
min_dev, etc.  We've got a scanning tool in bogotune that will
empirically find an answer.  Wouldn't it be nice to have a mathematical
formula?

Finding the ESF values is no harder and no easier.  Now that bogotune is
scanning for 5 parameters (rather than 3), the scan time has increased
dramatically.  

Wish we had that formula!

Regards,

David



More information about the Bogofilter mailing list