Wordlist Histogram [was: What did I do wrong? ]

David Relson relson at osagesoftware.com
Fri Feb 20 01:19:01 CET 2004


On Thu, 19 Feb 2004 19:07:48 -0500
Tom Allison wrote:

> David Relson wrote:
> > On Thu, 19 Feb 2004 14:51:21 +0100
> > Boris 'pi' Piwinger wrote:
> > 
> > 
> >>David Relson wrote:
> >>
> >>[bogoutil -H]
> >>
> >>>hapaxes:  ham  375505 (29.72%), spam  443797 (35.12%)
> >>>   pure:  ham  562881 (44.55%), spam  616022 (48.75%)
> >>
> >>What is the meaning of pure? Tokens which have been seen
> >>only once for one category, but possibly many times in the
> >>other?
> > 
> > 
> > hapaxes have a total ham+spam count of 1.  "pure" indicates either
> > ham or spam is 0.  Given this, all hapaxes are "pure".  I'm open to
> > suggestions for better labels :-)
> > 
> 
> hetero and homo prefixes to something to indicate a mixed (spam and
> ham presence) and singular or pure presence.
> 
> I'm curious as to what these values indicate.
> How would I interpret this correctly?

ham hapaxes - tokens that occurred exactly once and that once was in
ham.
spam hapaxes - tokens that occurred exactly once and that once was in
spam.

pure ham - multiple occurrences of the token, but only in ham.
pure spam - multiple occurrences of the token, but only in spam.

Together they indicate that my wordlist has many tokens that have only
appeared in a single email and many other tokens that have only been in
ham (or in spam) messaes.




More information about the Bogofilter mailing list