Clean the database from non-spam mails?
Chris Wilkes
cwilkes-bf at ladro.com
Wed Dec 3 01:12:36 CET 2003
On Tue, Dec 02, 2003 at 06:58:44PM -0500, David Relson wrote:
> On Tue, 2 Dec 2003 10:56:53 -0800
> Chris Wilkes <cwilkes-bf at ladro.com> wrote:
>
> > # bogoutil -d /tmp/orig/wordlist.db | \
> > awk '$2 > $3 {print}' | bogoutil -l /tmp/onlyspam/wordlist.db
> >
> > # bogoutil -w /tmp/orig/wordlist.db enlarge
> > spam good
> > enlarge 87 0
> >
> > # bogoutil -w /tmp/onlyspam/wordlist.db enlarge
> > spam good
> > enlarge 87 0
> >
> > An email with 'enlarge' in it ($s = email message file):
> >
> > # bogofilter -vvv -d /tmp/orig/ -I $s | grep enlarge
> > "enlarge" 87 0.000000 0.021712 0.999933 +
> >
> > # bogofilter -vvv -d /tmp/onlyspam/ -I $s | grep enlarge
> > "enlarge" 0 0.000000 0.000000 0.415000 -
>
> From the zero counts and the 0.415000 value it appears that "enlarge" is
> not in /tmp/onlyspam/wordlist.db. Use "bogoutil -d
> /tmp/onlyspam/wordlist.db | grep enlarge" to see if it's there or not.
I quadrupled checked that -- I did a "-w ... enlarge" to see if it was
in there (see above) and it is, and I also did a grep for it and it
shows up.
# bogoutil -d /tmp/onlyspam/wordlist.db | grep enlarge
enlarge 87 0 20031129
enlargeable 2 0 20031116
enlargement 95 0 20031129
subj:enlarge 3 0 20031009
subj:enlargement 13 0 20031127
# bogoutil -d /tmp/orig/wordlist.db | grep enlarge
enlarge 87 0 20031129
enlargeable 2 0 20031116
enlargement 95 0 20031129
subj:enlarge 3 0 20031009
subj:enlargement 13 0 20031127
The "enlarge" in my email message is in the body, not in the subject
line.
> To check the whole operation, I'd do something like:
>
> bogoutil -d /tmp/orig/wordlist.db | tee orig.tmp | wc -l
> bogoutil -d /tmp/onlyspam/wordlist.db | tee spam.tmp | wc -l
> diff orig.tmp spam.tmp
The difference is pretty big: 81k words in the original list versus 58k
in the new one. But all I care about is the "enlarge" token, right?
I'm wondering if by throwing out certain tokens with my awk script that
I somehow corrupted that bogofilter is expecting to find in the
database. Its just a list of word tokens, right? Ie if I throw out
"magiccookieyouneedthistorun 4 10 20031110" its not going to affect
anything.
Chris
More information about the Bogofilter
mailing list