Hapax survival over time

Tom Anderson tanderso at oac-design.com
Wed Mar 24 15:09:09 CET 2004


On Wed, 2004-03-24 at 07:31, David Relson wrote:
> Of course a single occurrence of a token has a big influence.  In the
> simplest case, consider a wordlist built from 2 messages - 1 each of ham
> and spam.  Every token will be in 1 of 3 states - pure ham, pure spam,
> or 50-50.  Scoring a new message will give 4 token values - ham, spam,
> 50-50, and unknown.

I don't concur with this logic.  If I'm learning a new language such as
Spanish, and I encounter a new term, let's say "por", which I look up
and understand in one context, it doesn't mean I'm confident about what
it means in all possible contexts.  In this case, "por" could mean
"through", "for", "along", or "by".  The reason I use a comparison like
this is because humans and all animals learn through a Bayesian process
(yep, I remember my Philosophy courses), so it ought to be similar how
bogofilter learns as well.

Following the above argument, when bogofilter first "looks up" a new
word, it should not have high confidence in how that word will be used
in future instances.  Therefore, it should not move too far from robx. 
I would want at least several registrations before it moves out of my
min_dev range and starts affecting classifications.

> Not really.  I transcribed the command wrongly.  It should be:
>  
>    bogoutil -d wordlist.db | \
>    egrep " (0 1|1 0) "     | \
>    awk '{print $1}'        | \
>    bogoutil -p wordlist.db

[tanderso at www .bogofilter]$ bogoutil -d wordlist.db | egrep " (0 1|1 0)
" | awk '{print $1}' | bogoutil -p wordlist.db

                                 spam    good  Gra prob  Rob/Fis
$0.080                              0       1  0.400000  0.000000
$0.185                              0       1  0.400000  0.000000
$0.21                               1       0  0.400000  0.000000
$0.27                               0       1  0.400000  0.000000
$0.28                               0       1  0.400000  0.000000
$0.51                               1       0  0.400000  0.000000
$0.52                               0       1  0.400000  0.000000
$0.56                               1       0  0.400000  0.000000
...

I still don't see a Fisher probability, and I don't imagine 0.4 is
correct for Graham either.

Tom

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20040324/d4ffa96d/attachment.sig>


More information about the Bogofilter mailing list