Hapax survival over time
David Relson
relson at osagesoftware.com
Wed Mar 24 05:26:32 CET 2004
On 23 Mar 2004 23:09:39 -0500
Tom Anderson wrote:
...[snip]...
> I'm not sure a longer period of time is really necessary. Clearly if
> a token has been seen only once in 20-30 days, it does not play a very
> large roll in classifying the vast bulk of your messages. Therefore,
> it could not possibly hurt to delete it and then score it at robx on
> day 31+. How strong of an indicator could it be if it is seen so
> infrequently?
>
> This leads me to propose a different study... how many of those
> hapaxes are outside of your min_dev range? How many further
> registrations does it take to move them into an influential scoring
> range?
Tom,
Sorry to say, but that study is not very interesting. A hapax is a
token
that has appeared exactly one. That means it's score is roughly 0.0 (if
the once was in ham) or 1.0 (if it was in spam).
At present I have 1,296,178 tokens in wordlist.db. Of them 848,752 are
hapaxes. To look at their scores I ran
bogoutil -d wordlist.db | egrep " (0 1|1 0) " | bogoutil -p wordlist.db
The output is:
spam good Fisher
$0.0 1 0 0.994208
$0.024 0 1 0.004109
$0.044 0 1 0.004109
$0.049 1 0 0.994208
$0.05 0 1 0.004109
$0.075 1 0 0.994208
$0.080 1 0 0.994208
$0.14 0 1 0.004109
$0.18 1 0 0.994208
$0.185 1 0 0.994208
Enjoy,
David
>
> Tom
>
>
--
David Relson Osage Software Systems, Inc.
relson at osagesoftware.com Ann Arbor, MI 48103
www.osagesoftware.com tel: 734.821.8800
More information about the Bogofilter
mailing list