Clean the database from non-spam mails?
David Relson
relson at osagesoftware.com
Wed Dec 3 00:58:44 CET 2003
On Tue, 2 Dec 2003 10:56:53 -0800
Chris Wilkes <cwilkes-bf at ladro.com> wrote:
> On Tue, Dec 02, 2003 at 07:14:10PM +0100, Johannes Klug wrote:
> >
> > I'd like to remove all non-spam emails from my database. I
> > trained bogofilter only with about 200 ham emails, now my ham-box
> > is about 700.
>
> I did a little experiment with removing all words whose ham counts
> were higher than their spam counts, by filtering the output of
> bogoutil -d, which is:
> 1 word
> 2 spam count
> 3 ham count
> 4 date updated
>
> # bogoutil -d ./wordlist.db | awk '$2 > $3 {print}' | bogoutil -l
> ./new.db
>
> However this causes all my emails that registered with a spamicity of
> 1.0000 to fall to 0.41. Just looking for one obvious spam word
> 'enlarge' in an email:
>
> # bogoutil -d /tmp/orig/wordlist.db | \
> awk '$2 > $3 {print}' | bogoutil -l /tmp/onlyspam/wordlist.db
>
> # bogoutil -w /tmp/orig/wordlist.db enlarge
> spam good
> enlarge 87 0
>
> # bogoutil -w /tmp/onlyspam/wordlist.db enlarge
> spam good
> enlarge 87 0
>
> An email with 'enlarge' in it ($s = email message file):
>
> # bogofilter -vvv -d /tmp/orig/ -I $s | grep enlarge
> "enlarge" 87 0.000000 0.021712 0.999933 +
>
> # bogofilter -vvv -d /tmp/onlyspam/ -I $s | grep enlarge
> "enlarge" 0 0.000000 0.000000 0.415000 -
>
> Doesn't seem to pick it up now. Did I screw up something with
> creating the new wordlist.db file? The spam score of that email went
> from 1.0000 to 0.415000. I'm running version 0.15.9.
>
> Chris
Hi Chris,
Not being really good with awk, I tried your command and found it to
work fine for me. To see more about what it does, I ran
bogoutil -d /tmp/orig/wordlist.db | tee 1.tmp | \
awk'$2 > $3 {print}' | tee 2.tmp | \
bogoutil -l /tmp/onlyspam/wordlist.db
and than ran "gtkdiff 1.tmp 2.tmp".
>From the zero counts and the 0.415000 value it appears that "enlarge" is
not in /tmp/onlyspam/wordlist.db. Use "bogoutil -d
/tmp/onlyspam/wordlist.db | grep enlarge" to see if it's there or not.
To check the whole operation, I'd do something like:
bogoutil -d /tmp/orig/wordlist.db | tee orig.tmp | wc -l
bogoutil -d /tmp/onlyspam/wordlist.db | tee spam.tmp | wc -l
diff orig.tmp spam.tmp
David
More information about the Bogofilter
mailing list