histogram of wordlist.db

David Relson relson at osagesoftware.com
Sat Jan 3 23:32:28 CET 2004


On Sat, 03 Jan 2004 22:50:26 +0100
Stefan Bellon <sbellon at sbellon.de> wrote:


> But your 0.00 to 0.95 ratio is more even than mine. I think this means
> I get more spam than ham. Does this mean I should switch from training
> each message to train on error? Can't Bogofilter itself account for
> that?

The histogram doesn't include any info about ham/spam message counts or
ratios.  Likely it should, at the least, give the counts.

In actual use, the token counts are normalized using message counts. 
For example if you have trained with twice as many ham messages and a
ham token has twice the count of a spam token, their scores will be the
same.

> BTW: I'm still using the default values (robx==0.415 and min_dev==0.1)
> with the above results.

The purpose of the histogram is to give an idea of the distribution of
scores.  The code generating it is simplistic and always uses the
default robx and robs values.  min_dev doesn't apply to single tokens. 
It determines which tokens are included in a message's final score.




More information about the bogofilter-dev mailing list