Clean the database from non-spam mails?

David Relson relson at osagesoftware.com
Wed Dec 3 00:58:44 CET 2003


On Tue, 2 Dec 2003 10:56:53 -0800
Chris Wilkes <cwilkes-bf at ladro.com> wrote:

> On Tue, Dec 02, 2003 at 07:14:10PM +0100, Johannes Klug wrote:
> > 
> > I'd like to remove all non-spam emails from my database. I
> > trained bogofilter only with about 200 ham emails, now my ham-box
> > is about 700.
> 
> I did a little experiment with removing all words whose ham counts
> were higher than their spam counts, by filtering the output of
> bogoutil -d, which is:
>   1  word
>   2  spam count
>   3  ham  count
>   4  date updated
> 
>   # bogoutil -d ./wordlist.db | awk '$2 > $3 {print}' | bogoutil -l
>   ./new.db
> 
> However this causes all my emails that registered with a spamicity of
> 1.0000 to fall to 0.41.  Just looking for one obvious spam word
> 'enlarge' in an email:
> 
>   # bogoutil -d /tmp/orig/wordlist.db |  \
>     awk '$2 > $3 {print}' | bogoutil -l /tmp/onlyspam/wordlist.db
>   
>   # bogoutil -w /tmp/orig/wordlist.db enlarge
>                                  spam   good
>                      enlarge       87      0
> 
>   # bogoutil -w /tmp/onlyspam/wordlist.db enlarge
>                                  spam   good
>                      enlarge       87      0
> 
> An email with 'enlarge' in it ($s = email message file):
> 
>   # bogofilter -vvv -d /tmp/orig/     -I $s | grep enlarge
>     "enlarge"         87  0.000000  0.021712  0.999933 +
> 
>   # bogofilter -vvv -d /tmp/onlyspam/ -I $s | grep enlarge
>     "enlarge"          0  0.000000  0.000000  0.415000 -
> 
> Doesn't seem to pick it up now.  Did I screw up something with
> creating the new wordlist.db file?  The spam score of that email went
> from 1.0000 to 0.415000.  I'm running version 0.15.9.
> 
> Chris

Hi Chris,

Not being really good with awk, I tried your command and found it to
work fine for me.  To see more about what it does, I ran

   bogoutil -d /tmp/orig/wordlist.db | tee 1.tmp | \
   awk'$2 > $3 {print}' | tee 2.tmp | \
   bogoutil -l /tmp/onlyspam/wordlist.db

and than ran "gtkdiff 1.tmp 2.tmp".

>From the zero counts and the 0.415000 value it appears that "enlarge" is
not in /tmp/onlyspam/wordlist.db.  Use "bogoutil -d
/tmp/onlyspam/wordlist.db | grep enlarge" to see if it's there or not. 
To check the whole operation, I'd do something like:

    bogoutil -d /tmp/orig/wordlist.db | tee orig.tmp | wc -l
    bogoutil -d /tmp/onlyspam/wordlist.db | tee spam.tmp | wc -l
    diff orig.tmp spam.tmp

David




More information about the Bogofilter mailing list