Hapax survival over time

David Relson relson at osagesoftware.com
Wed Mar 24 05:26:32 CET 2004


On 23 Mar 2004 23:09:39 -0500
Tom Anderson wrote:

...[snip]...

> I'm not sure a longer period of time is really necessary.  Clearly if
> a token has been seen only once in 20-30 days, it does not play a very
> large roll in classifying the vast bulk of your messages.  Therefore,
> it could not possibly hurt to delete it and then score it at robx on
> day 31+.  How strong of an indicator could it be if it is seen so
> infrequently?
> 
> This leads me to propose a different study... how many of those
> hapaxes are outside of your min_dev range?  How many further
> registrations does it take to move them into an influential scoring
> range?

Tom,

Sorry to say, but that study is not very interesting.  A hapax is a
token
that has appeared exactly one.  That means it's score is roughly 0.0 (if
the once was in ham) or 1.0 (if it was in spam).

At present I have 1,296,178 tokens in wordlist.db.  Of them 848,752 are
hapaxes.  To look at their scores I ran

bogoutil -d wordlist.db | egrep " (0 1|1 0) " | bogoutil -p wordlist.db

The output is:
                                 spam    good    Fisher
$0.0                                1       0  0.994208
$0.024                              0       1  0.004109
$0.044                              0       1  0.004109
$0.049                              1       0  0.994208
$0.05                               0       1  0.004109
$0.075                              1       0  0.994208
$0.080                              1       0  0.994208
$0.14                               0       1  0.004109
$0.18                               1       0  0.994208
$0.185                              1       0  0.994208

Enjoy,

David


> 
> Tom
> 
> 


-- 
David Relson                   Osage Software Systems, Inc.
relson at osagesoftware.com       Ann Arbor, MI 48103
www.osagesoftware.com          tel:  734.821.8800




More information about the Bogofilter mailing list