divergence?

Tom Anderson tanderso at oac-design.com
Mon May 17 14:29:42 CEST 2004


From: "Tom Allison" <tallison at tacocat.net>
> I'm looking for ideas on how to manage this better.  I'm afraid that if
> I just run retraining loops every single day I'll eventually end up with
> a wordlist that has a grosse divergence from reality.

I do my corrections with bfproxy using the exhaustive setting.  Usually my
unsures and false negatives train in one, sometimes two, repetitions.
However, some of them can take 10 or more.  Here are a few examples:

subject: Email traffic
original spamicity: 0.000194
user classification: spam
command: bogofilter -Ns
words: 617
new spamicity: 0.181389
new spamicity: 0.299991
new spamicity: 0.368289
new spamicity: 0.409813
new spamicity: 0.436418
new spamicity: 0.459002
new spamicity: 0.470369

subject: Re: Excel file
original spamicity: 0.026707
user classification: spam
command: bogofilter -Ns
words: 52
new spamicity: 0.017475
new spamicity: 0.060528
new spamicity: 0.107725
new spamicity: 0.151361
new spamicity: 0.189718
new spamicity: 0.222898
new spamicity: 0.251497
new spamicity: 0.276189
new spamicity: 0.297587
new spamicity: 0.316219
new spamicity: 0.332522

subject: Bigger is Better - It is all natural
original spamicity: 0.011621
user classification: spam
command: bogofilter -Ns
words: 124
new spamicity: 0.138941
new spamicity: 0.185118
new spamicity: 0.218314
new spamicity: 0.244444
new spamicity: 0.266013
new spamicity: 0.284358
new spamicity: 0.300287
new spamicity: 0.314330
new spamicity: 0.326855
new spamicity: 0.338132
new spamicity: 0.348363

Doing the repetition helps to make more neutral any very hammy tokens that
nonetheless show up in spam.  It shouldn't have a huge effect on your hams,
as they should contain other hammy tokens that do not appear in spams, or at
least not as much.  I've found much improved accuracy doing this.  I would
suggest a conservative configuration though, because if you bias toward spam
it's conceivable you might get false positives.

Tom




More information about the Bogofilter mailing list