[bogofilter] Improved Calculations

michael at optusnet.com.au michael at optusnet.com.au
Fri May 14 01:35:17 CEST 2004


"Boris 'pi' Piwinger" <3.14 at piology.org> writes:
> michael at optusnet.com.au wrote:
> >> > Gary Robinson's blog has a new "Improved Chi" article (at
> >> > http://www.garyrobinson.net/2004/04/improved_chi.html) which points to
> >> > his new paper, "Handling Redundancy in Email Token Probabilities" (at
> >> > http://garyrob.blogs.com//handlingtokenredundancy93.pdf).  
> >> 
> >> I read those. I have to say, that -- without knowing the
> >> theory -- this is something I really don't understand. The
> >> new parameters may have some intuition, but how they are
> >> used is magic. That makes it hard to guess good values, as
> >> the article explains, only testing seems to work.
> >
> >IMHO, at least the ham parameter should be computable from
> >the informational entropy of english text.
> 
> Why English? I receive message in German and English,
> occasionally in other languages. Certainly every language
> would have such a parameter.

Random choice. :) I very vaguely recall the entropy of
the various human languages is fairly similar in the
limit. 

The idea in the ESF work is that spam is not a random
subset of human languages, but a selected subset.



More information about the Bogofilter mailing list