histogram of wordlist.db
Stefan Bellon
sbellon at sbellon.de
Sat Jan 3 22:50:26 CET 2004
David Relson wrote:
> score count pct histogram
> 0.00 545757 47.13 ################################################
> 0.05 3099 0.27 #
> 0.10 3054 0.26 #
> 0.15 3128 0.27 #
> 0.20 4015 0.35 #
> 0.25 2112 0.18 #
> 0.30 4395 0.38 #
> 0.35 5326 0.46 #
> 0.40 2122 0.18 #
> 0.45 3178 0.27 #
> 0.50 2093 0.18 #
> 0.55 10681 0.92 #
> 0.60 2509 0.22 #
> 0.65 3163 0.27 #
> 0.70 6122 0.53 #
> 0.75 4926 0.43 #
> 0.80 3891 0.34 #
> 0.85 4324 0.37 #
> 0.90 5004 0.43 #
> 0.95 539119 46.56 ################################################
> tot 1158018
> hapaxes: ham 359544 (31.05%), spam 376536 (32.52%)
> pure: ham 542992 (46.89%), spam 535376 (46.23%)
Wow, mine looks indeed very similar:
score count pct histogram
0.00 123531 36.73 ###############################
0.05 1053 0.31 #
0.10 974 0.29 #
0.15 1094 0.33 #
0.20 923 0.27 #
0.25 989 0.29 #
0.30 1163 0.35 #
0.35 629 0.19 #
0.40 1569 0.47 #
0.45 772 0.23 #
0.50 608 0.18 #
0.55 2978 0.89 #
0.60 427 0.13 #
0.65 817 0.24 #
0.70 1621 0.48 #
0.75 553 0.16 #
0.80 1066 0.32 #
0.85 1295 0.39 #
0.90 1104 0.33 #
0.95 193178 57.43 ################################################
tot 336344
hapaxes: ham 72803 (21.65%), spam 138101 (41.06%)
pure: ham 122645 (36.46%), spam 192225 (57.15%)
But your 0.00 to 0.95 ratio is more even than mine. I think this means
I get more spam than ham. Does this mean I should switch from training
each message to train on error? Can't Bogofilter itself account for
that?
BTW: I'm still using the default values (robx==0.415 and min_dev==0.1)
with the above results.
--
Stefan Bellon
More information about the bogofilter-dev
mailing list