Hapax survival over time

Tom Anderson tanderso at oac-design.com
Wed Mar 24 06:03:26 CET 2004


On Tue, 2004-03-23 at 23:26, David Relson wrote:
> Sorry to say, but that study is not very interesting.  A hapax is a
> token
> that has appeared exactly one.  That means it's score is roughly 0.0 (if
> the once was in ham) or 1.0 (if it was in spam).

I wouldn't expect just one registration to have such an influence.  This
seems like dangerous behavior, no?  If I had only seen "viagra" once
before, I wouldn't assume immediately that it was spam, but this seems
to be what you're saying bogofilter will do.  It should take multiple
instances of a token in spams to make them spammy.  Is your robs value
playing a role here?

> bogoutil -d wordlist.db | egrep " (0 1|1 0) " | bogoutil -p wordlist.db
> 
> The output is:
>                                  spam    good    Fisher
> $0.0                                1       0  0.994208
> $0.024                              0       1  0.004109
> $0.044                              0       1  0.004109
> $0.049                              1       0  0.994208
> $0.05                               0       1  0.004109

Apparently my version outputs something slightly different:

                                 spam    good  Gra prob  Rob/Fis
$0.003 1 0 20031016                 0       0  0.400000  0.000000
$0.011 0 1 20031217                 0       0  0.400000  0.000000
$0.024 0 1 20040323                 0       0  0.400000  0.000000
$0.044 0 1 20040323                 0       0  0.400000  0.000000
$0.045 1 0 20031203                 0       0  0.400000  0.000000
$0.049 0 1 20040323                 0       0  0.400000  0.000000
$0.052 1 0 20031124                 0       0  0.400000  0.000000
$0.055 1 0 20031027                 0       0  0.400000  0.000000

Odd?

Tom

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20040324/ae332db2/attachment.sig>


More information about the Bogofilter mailing list