Clean the database from non-spam mails?

Chris Wilkes cwilkes-bf at ladro.com
Wed Dec 3 01:12:36 CET 2003


On Tue, Dec 02, 2003 at 06:58:44PM -0500, David Relson wrote:
> On Tue, 2 Dec 2003 10:56:53 -0800
> Chris Wilkes <cwilkes-bf at ladro.com> wrote:
> 
> >   # bogoutil -d /tmp/orig/wordlist.db |  \
> >     awk '$2 > $3 {print}' | bogoutil -l /tmp/onlyspam/wordlist.db
> >   
> >   # bogoutil -w /tmp/orig/wordlist.db enlarge
> >                                  spam   good
> >                      enlarge       87      0
> > 
> >   # bogoutil -w /tmp/onlyspam/wordlist.db enlarge
> >                                  spam   good
> >                      enlarge       87      0
> > 
> > An email with 'enlarge' in it ($s = email message file):
> > 
> >   # bogofilter -vvv -d /tmp/orig/     -I $s | grep enlarge
> >     "enlarge"         87  0.000000  0.021712  0.999933 +
> > 
> >   # bogofilter -vvv -d /tmp/onlyspam/ -I $s | grep enlarge
> >     "enlarge"          0  0.000000  0.000000  0.415000 -
> 
> From the zero counts and the 0.415000 value it appears that "enlarge" is
> not in /tmp/onlyspam/wordlist.db.  Use "bogoutil -d
> /tmp/onlyspam/wordlist.db | grep enlarge" to see if it's there or not. 

I quadrupled checked that -- I did a "-w ... enlarge" to see if it was
in there (see above) and it is, and I also did a grep for it and it
shows up.

  # bogoutil -d /tmp/onlyspam/wordlist.db | grep enlarge
    enlarge 87 0 20031129
    enlargeable 2 0 20031116
    enlargement 95 0 20031129
    subj:enlarge 3 0 20031009
    subj:enlargement 13 0 20031127

  # bogoutil -d /tmp/orig/wordlist.db | grep enlarge
    enlarge 87 0 20031129
    enlargeable 2 0 20031116
    enlargement 95 0 20031129
    subj:enlarge 3 0 20031009
    subj:enlargement 13 0 20031127

The "enlarge" in my email message is in the body, not in the subject
line.

> To check the whole operation, I'd do something like:
> 
>     bogoutil -d /tmp/orig/wordlist.db | tee orig.tmp | wc -l
>     bogoutil -d /tmp/onlyspam/wordlist.db | tee spam.tmp | wc -l
>     diff orig.tmp spam.tmp

The difference is pretty big: 81k words in the original list versus 58k
in the new one.  But all I care about is the "enlarge" token, right?
I'm wondering if by throwing out certain tokens with my awk script that
I somehow corrupted that bogofilter is expecting to find in the
database.  Its just a list of word tokens, right?  Ie if I throw out
"magiccookieyouneedthistorun 4 10 20031110" its not going to affect
anything.

Chris




More information about the Bogofilter mailing list