Hapax survival over time
David Relson
relson at osagesoftware.com
Wed Mar 24 13:31:00 CET 2004
On 24 Mar 2004 00:03:26 -0500
Tom Anderson wrote:
> On Tue, 2004-03-23 at 23:26, David Relson wrote:
> > Sorry to say, but that study is not very interesting. A hapax is a
> > token
> > that has appeared exactly one. That means it's score is roughly 0.0
> > (if the once was in ham) or 1.0 (if it was in spam).
>
> I wouldn't expect just one registration to have such an influence.
> This seems like dangerous behavior, no? If I had only seen "viagra"
> once before, I wouldn't assume immediately that it was spam, but this
> seems to be what you're saying bogofilter will do. It should take
> multiple instances of a token in spams to make them spammy. Is your
> robs value playing a role here?
Hi Tom,
Of course a single occurrence of a token has a big influence. In the
simplest case, consider a wordlist built from 2 messages - 1 each of ham
and spam. Every token will be in 1 of 3 states - pure ham, pure spam,
or 50-50. Scoring a new message will give 4 token values - ham, spam,
50-50, and unknown.
> > bogoutil -d wordlist.db | egrep " (0 1|1 0) " | bogoutil -p
> > wordlist.db
> >
> > The output is:
> > spam good Fisher
> > $0.0 1 0 0.994208
> > $0.024 0 1 0.004109
> > $0.044 0 1 0.004109
> > $0.049 1 0 0.994208
> > $0.05 0 1 0.004109
>
> Apparently my version outputs something slightly different:
>
> spam good Gra prob Rob/Fis
> $0.003 1 0 20031016 0 0 0.400000 0.000000
> $0.011 0 1 20031217 0 0 0.400000 0.000000
> $0.024 0 1 20040323 0 0 0.400000 0.000000
> $0.044 0 1 20040323 0 0 0.400000 0.000000
> $0.045 1 0 20031203 0 0 0.400000 0.000000
> $0.049 0 1 20040323 0 0 0.400000 0.000000
> $0.052 1 0 20031124 0 0 0.400000 0.000000
> $0.055 1 0 20031027 0 0 0.400000 0.000000
>
> Odd?
Not really. I transcribed the command wrongly. It should be:
bogoutil -d wordlist.db | \
egrep " (0 1|1 0) " | \
awk '{print $1}' | \
bogoutil -p wordlist.db
(Feel free to remove the backslashes and make it one long command).
More information about the Bogofilter
mailing list