histogram of wordlist.db

David Relson relson at osagesoftware.com
Sat Jan 3 15:26:00 CET 2004


Greetings,

Have you ever wondered what it would look like if you had a histogram of
the spamicity scores of all the tokens in your wordlist?  Mine looks
like:

score   count  pct  histogram
0.00   545757 47.13 ################################################
0.05     3099  0.27 #
0.10     3054  0.26 #
0.15     3128  0.27 #
0.20     4015  0.35 #
0.25     2112  0.18 #
0.30     4395  0.38 #
0.35     5326  0.46 #
0.40     2122  0.18 #
0.45     3178  0.27 #
0.50     2093  0.18 #
0.55    10681  0.92 #
0.60     2509  0.22 #
0.65     3163  0.27 #
0.70     6122  0.53 #
0.75     4926  0.43 #
0.80     3891  0.34 #
0.85     4324  0.37 #
0.90     5004  0.43 #
0.95   539119 46.56 ################################################
tot   1158018
hapaxes:  ham  359544 (31.05%), spam  376536 (32.52%)
   pure:  ham  542992 (46.89%), spam  535376 (46.23%)

The numbers at the end are counts of tokens that appear with counts 0/1
or 1/0 (also known as hapaxes) and counts that trained solely from ham
or spam messages, i.e. have counts of h/0 or 0/s.

The attached patch, applied to 0.16.0, will enable the feature.  To use
it, run "bogoutil -H /your/bogofilter/dir"

Enjoy!

David

-- 
David Relson                   Osage Software Systems, Inc.
relson at osagesoftware.com       Ann Arbor, MI 48103
www.osagesoftware.com          tel:  734.821.8800
-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch.bogohist.0103
Type: application/octet-stream
Size: 7607 bytes
Desc: not available
URL: <https://www.bogofilter.org/pipermail/bogofilter-dev/attachments/20040103/8061b0da/attachment.obj>


More information about the bogofilter-dev mailing list