Wordlist Histogram [was: What did I do wrong? ]
David Relson
relson at osagesoftware.com
Fri Feb 20 01:19:01 CET 2004
On Thu, 19 Feb 2004 19:07:48 -0500
Tom Allison wrote:
> David Relson wrote:
> > On Thu, 19 Feb 2004 14:51:21 +0100
> > Boris 'pi' Piwinger wrote:
> >
> >
> >>David Relson wrote:
> >>
> >>[bogoutil -H]
> >>
> >>>hapaxes: ham 375505 (29.72%), spam 443797 (35.12%)
> >>> pure: ham 562881 (44.55%), spam 616022 (48.75%)
> >>
> >>What is the meaning of pure? Tokens which have been seen
> >>only once for one category, but possibly many times in the
> >>other?
> >
> >
> > hapaxes have a total ham+spam count of 1. "pure" indicates either
> > ham or spam is 0. Given this, all hapaxes are "pure". I'm open to
> > suggestions for better labels :-)
> >
>
> hetero and homo prefixes to something to indicate a mixed (spam and
> ham presence) and singular or pure presence.
>
> I'm curious as to what these values indicate.
> How would I interpret this correctly?
ham hapaxes - tokens that occurred exactly once and that once was in
ham.
spam hapaxes - tokens that occurred exactly once and that once was in
spam.
pure ham - multiple occurrences of the token, but only in ham.
pure spam - multiple occurrences of the token, but only in spam.
Together they indicate that my wordlist has many tokens that have only
appeared in a single email and many other tokens that have only been in
ham (or in spam) messaes.
More information about the Bogofilter
mailing list