histogram of wordlist.db

Stefan Bellon sbellon at sbellon.de
Sat Jan 3 22:50:26 CET 2004


David Relson wrote:

> score   count  pct  histogram
> 0.00   545757 47.13 ################################################
> 0.05     3099  0.27 #
> 0.10     3054  0.26 #
> 0.15     3128  0.27 #
> 0.20     4015  0.35 #
> 0.25     2112  0.18 #
> 0.30     4395  0.38 #
> 0.35     5326  0.46 #
> 0.40     2122  0.18 #
> 0.45     3178  0.27 #
> 0.50     2093  0.18 #
> 0.55    10681  0.92 #
> 0.60     2509  0.22 #
> 0.65     3163  0.27 #
> 0.70     6122  0.53 #
> 0.75     4926  0.43 #
> 0.80     3891  0.34 #
> 0.85     4324  0.37 #
> 0.90     5004  0.43 #
> 0.95   539119 46.56 ################################################
> tot   1158018
> hapaxes:  ham  359544 (31.05%), spam  376536 (32.52%)
>    pure:  ham  542992 (46.89%), spam  535376 (46.23%)

Wow, mine looks indeed very similar:

score   count  pct  histogram
0.00   123531 36.73 ###############################
0.05     1053  0.31 #
0.10      974  0.29 #
0.15     1094  0.33 #
0.20      923  0.27 #
0.25      989  0.29 #
0.30     1163  0.35 #
0.35      629  0.19 #
0.40     1569  0.47 #
0.45      772  0.23 #
0.50      608  0.18 #
0.55     2978  0.89 #
0.60      427  0.13 #
0.65      817  0.24 #
0.70     1621  0.48 #
0.75      553  0.16 #
0.80     1066  0.32 #
0.85     1295  0.39 #
0.90     1104  0.33 #
0.95   193178 57.43 ################################################
tot    336344
hapaxes:  ham   72803 (21.65%), spam  138101 (41.06%)
   pure:  ham  122645 (36.46%), spam  192225 (57.15%)

But your 0.00 to 0.95 ratio is more even than mine. I think this means
I get more spam than ham. Does this mean I should switch from training
each message to train on error? Can't Bogofilter itself account for
that?

BTW: I'm still using the default values (robx==0.415 and min_dev==0.1)
with the above results.

-- 
Stefan Bellon




More information about the bogofilter-dev mailing list