Wordlist Histogram [was: What did I do wrong? ]

David Relson relson at osagesoftware.com
Thu Feb 19 13:25:20 CET 2004


On Thu, 19 Feb 2004 00:18:39 -0500
Eric Wood wrote:

> On Thu, 19 Feb 2004 00:04:58 -0500
> Eric Wood <eric at interplas.com> wrote:
> > For giggles, I also ran the singleton purge command above and it
> > didn't change the new wordlist to make it even small as I would have
> > expected.  Is that unusual?
> 
> Nevermind I ran "-c1 -a30" on the original wordlist earlier... ignore
> my previous question.  Thanks!
> -Eric Wood

Eric,

About the first of the year, bogoutil gained the "-H" option that
generates a histogram of the wordlist.  The histogram shows token counts
vs. spam score and also gives counts and percentages for hapaxes and
tokens that are pure ham or spam.  The info might help in making
maintenance decisions.

David

P.S.  This is what my wordlist currently looks like:

Histogram
score   count  pct  histogram
0.00   565911 44.79 ############################################
0.05     3188  0.25 #
0.10     3116  0.25 #
0.15     3151  0.25 #
0.20     4152  0.33 #
0.25     4077  0.32 #
0.30     2426  0.19 #
0.35     5393  0.43 #
0.40     2290  0.18 #
0.45     3300  0.26 #
0.50     2148  0.17 #
0.55    11409  0.90 #
0.60     2627  0.21 #
0.65     3426  0.27 #
0.70     6455  0.51 #
0.75     4816  0.38 #
0.80     4593  0.36 #
0.85     5463  0.43 #
0.90     5547  0.44 #
0.95   620108 49.07 ################################################
tot   1263596
hapaxes:  ham  375505 (29.72%), spam  443797 (35.12%)
   pure:  ham  562881 (44.55%), spam  616022 (48.75%)




More information about the Bogofilter mailing list