Clean the database from non-spam mails?

David Relson relson at osagesoftware.com
Wed Dec 3 05:14:03 CET 2003


On Tue, 2 Dec 2003 18:57:47 -0800
Chris Wilkes <cwilkes-bf at ladro.com> wrote:

> To show the problem:
> 
>  mkdir /tmp/orig /tmp/onlyspam
>  cd /tmp/orig
>  wget http://ladro.com/bf/bf-enlarge2.tar.gz
>  tar -xvzf bf-enlarge2.tar.gz
>  bogoutil -l ./wordlist.db < wordlist.txt
>  awk '$2 > $3 {print}' wordlist.txt | bogoutil -l
>  /tmp/onlyspam/wordlist.db bogofilter -v -x d -d /tmp/orig     <
>  Enlargeemail.txt bogofilter -v -x d -d /tmp/onlyspam <
>  Enlargeemail.txt
> 
> Chris
> 

Chris,

I have the answer to your problem!  Try finding .MSG_COUNT in the two
wordlists.  It's in the old but not the new.  Not knowing how many
messages were used to build the wordlist causes bogofilter to be behave
in an unexpected manner (which I can explain if you want).

What happened is that your awk script suppressed that token because its
ham count (4585) exceeded its spam count (4007).

I don't recall your original goal?  Was it to remove the ham portion of
the wordlist?  If so, then an appropriate use of "awk | grep" can do it
-- with awk changing the ham count to 0 and grep removing tokens with 0
counts for ham and spam.

Hope this helps.

David





More information about the Bogofilter mailing list