Hapax survival over time
Tom Anderson
tanderso at oac-design.com
Wed Mar 24 06:03:26 CET 2004
On Tue, 2004-03-23 at 23:26, David Relson wrote:
> Sorry to say, but that study is not very interesting. A hapax is a
> token
> that has appeared exactly one. That means it's score is roughly 0.0 (if
> the once was in ham) or 1.0 (if it was in spam).
I wouldn't expect just one registration to have such an influence. This
seems like dangerous behavior, no? If I had only seen "viagra" once
before, I wouldn't assume immediately that it was spam, but this seems
to be what you're saying bogofilter will do. It should take multiple
instances of a token in spams to make them spammy. Is your robs value
playing a role here?
> bogoutil -d wordlist.db | egrep " (0 1|1 0) " | bogoutil -p wordlist.db
>
> The output is:
> spam good Fisher
> $0.0 1 0 0.994208
> $0.024 0 1 0.004109
> $0.044 0 1 0.004109
> $0.049 1 0 0.994208
> $0.05 0 1 0.004109
Apparently my version outputs something slightly different:
spam good Gra prob Rob/Fis
$0.003 1 0 20031016 0 0 0.400000 0.000000
$0.011 0 1 20031217 0 0 0.400000 0.000000
$0.024 0 1 20040323 0 0 0.400000 0.000000
$0.044 0 1 20040323 0 0 0.400000 0.000000
$0.045 1 0 20031203 0 0 0.400000 0.000000
$0.049 0 1 20040323 0 0 0.400000 0.000000
$0.052 1 0 20031124 0 0 0.400000 0.000000
$0.055 1 0 20031027 0 0 0.400000 0.000000
Odd?
Tom
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20040324/ae332db2/attachment.sig>
More information about the Bogofilter
mailing list