Wordlist Histogram [was: What did I do wrong? ]
David Relson
relson at osagesoftware.com
Thu Feb 19 13:25:20 CET 2004
On Thu, 19 Feb 2004 00:18:39 -0500
Eric Wood wrote:
> On Thu, 19 Feb 2004 00:04:58 -0500
> Eric Wood <eric at interplas.com> wrote:
> > For giggles, I also ran the singleton purge command above and it
> > didn't change the new wordlist to make it even small as I would have
> > expected. Is that unusual?
>
> Nevermind I ran "-c1 -a30" on the original wordlist earlier... ignore
> my previous question. Thanks!
> -Eric Wood
Eric,
About the first of the year, bogoutil gained the "-H" option that
generates a histogram of the wordlist. The histogram shows token counts
vs. spam score and also gives counts and percentages for hapaxes and
tokens that are pure ham or spam. The info might help in making
maintenance decisions.
David
P.S. This is what my wordlist currently looks like:
Histogram
score count pct histogram
0.00 565911 44.79 ############################################
0.05 3188 0.25 #
0.10 3116 0.25 #
0.15 3151 0.25 #
0.20 4152 0.33 #
0.25 4077 0.32 #
0.30 2426 0.19 #
0.35 5393 0.43 #
0.40 2290 0.18 #
0.45 3300 0.26 #
0.50 2148 0.17 #
0.55 11409 0.90 #
0.60 2627 0.21 #
0.65 3426 0.27 #
0.70 6455 0.51 #
0.75 4816 0.38 #
0.80 4593 0.36 #
0.85 5463 0.43 #
0.90 5547 0.44 #
0.95 620108 49.07 ################################################
tot 1263596
hapaxes: ham 375505 (29.72%), spam 443797 (35.12%)
pure: ham 562881 (44.55%), spam 616022 (48.75%)
More information about the Bogofilter
mailing list