Hapax survival over time

David Relson relson at osagesoftware.com
Wed Mar 24 13:31:00 CET 2004


On 24 Mar 2004 00:03:26 -0500
Tom Anderson wrote:

> On Tue, 2004-03-23 at 23:26, David Relson wrote:
> > Sorry to say, but that study is not very interesting.  A hapax is a
> > token
> > that has appeared exactly one.  That means it's score is roughly 0.0
> > (if the once was in ham) or 1.0 (if it was in spam).
> 
> I wouldn't expect just one registration to have such an influence. 
> This seems like dangerous behavior, no?  If I had only seen "viagra"
> once before, I wouldn't assume immediately that it was spam, but this
> seems to be what you're saying bogofilter will do.  It should take
> multiple instances of a token in spams to make them spammy.  Is your
> robs value playing a role here?

Hi Tom,

Of course a single occurrence of a token has a big influence.  In the
simplest case, consider a wordlist built from 2 messages - 1 each of ham
and spam.  Every token will be in 1 of 3 states - pure ham, pure spam,
or 50-50.  Scoring a new message will give 4 token values - ham, spam,
50-50, and unknown.

> > bogoutil -d wordlist.db | egrep " (0 1|1 0) " | bogoutil -p
> > wordlist.db
> > 
> > The output is:
> >                                  spam    good    Fisher
> > $0.0                                1       0  0.994208
> > $0.024                              0       1  0.004109
> > $0.044                              0       1  0.004109
> > $0.049                              1       0  0.994208
> > $0.05                               0       1  0.004109
> 
> Apparently my version outputs something slightly different:
> 
>                                  spam    good  Gra prob  Rob/Fis
> $0.003 1 0 20031016                 0       0  0.400000  0.000000
> $0.011 0 1 20031217                 0       0  0.400000  0.000000
> $0.024 0 1 20040323                 0       0  0.400000  0.000000
> $0.044 0 1 20040323                 0       0  0.400000  0.000000
> $0.045 1 0 20031203                 0       0  0.400000  0.000000
> $0.049 0 1 20040323                 0       0  0.400000  0.000000
> $0.052 1 0 20031124                 0       0  0.400000  0.000000
> $0.055 1 0 20031027                 0       0  0.400000  0.000000
> 
> Odd?

Not really.  I transcribed the command wrongly.  It should be:
 
   bogoutil -d wordlist.db | \
   egrep " (0 1|1 0) "     | \
   awk '{print $1}'        | \
   bogoutil -p wordlist.db

(Feel free to remove the backslashes and make it one long command).




More information about the Bogofilter mailing list