[bogofilter] Improved Calculations

Thu May 13 13:46:26 CEST 2004

David Relson wrote:

>> > I read those. I have to say, that -- without knowing the
>> > theory -- this is something I really don't understand. The
>> > new parameters may have some intuition, but how they are
>> > used is magic. That makes it hard to guess good values, as
>> > the article explains, only testing seems to work.
>> 
>> IMHO, at least the ham parameter should be computable from
>> the informational entropy of english text.
>> 
>> And presumably the spam parameter would be the computable
>> by a similar mean on a sufficent quantity of spam text.
>> 
>> But the actual mapping from the entropy values to the ESF value
>> escapes me.
> 
> I've not yet found the actual mapping from wordlist to robs, robx,
> min_dev, etc. 

OK, but there is a clear intuition (which of course is a
dangerous thing;-). So we understand the consequences of
modifying the value.

> We've got a scanning tool in bogotune that will
> empirically find an answer. 

Not for train-on-error.

> Wouldn't it be nice to have a mathematical formula?

It would.

> Finding the ESF values is no harder and no easier. 

In train-on-error the values are built into the database (in
a subtle way, but they are in there). So I can easily shift
robx and the cutoffs to values I like. robs and min_dev are
harder, but that seems to be OK. For the new parameters I
don't even have an estimate.

pi