Clean the database from non-spam mails?

David Relson relson at osagesoftware.com
Wed Dec 3 01:26:43 CET 2003


On Tue, 2 Dec 2003 16:12:36 -0800
Chris Wilkes <cwilkes-bf at ladro.com> wrote:

> On Tue, Dec 02, 2003 at 06:58:44PM -0500, David Relson wrote:

...[snip]...

> I quadrupled checked that -- I did a "-w ... enlarge" to see if it was
> in there (see above) and it is, and I also did a grep for it and it
> shows up.
> 
>   # bogoutil -d /tmp/onlyspam/wordlist.db | grep enlarge
>     enlarge 87 0 20031129
>     enlargeable 2 0 20031116
>     enlargement 95 0 20031129
>     subj:enlarge 3 0 20031009
>     subj:enlargement 13 0 20031127
> 
>   # bogoutil -d /tmp/orig/wordlist.db | grep enlarge
>     enlarge 87 0 20031129
>     enlargeable 2 0 20031116
>     enlargement 95 0 20031129
>     subj:enlarge 3 0 20031009
>     subj:enlargement 13 0 20031127
> 
> The "enlarge" in my email message is in the body, not in the subject
> line.
> 
> > To check the whole operation, I'd do something like:
> > 
> >     bogoutil -d /tmp/orig/wordlist.db | tee orig.tmp | wc -l
> >     bogoutil -d /tmp/onlyspam/wordlist.db | tee spam.tmp | wc -l
> >     diff orig.tmp spam.tmp
> 
> The difference is pretty big: 81k words in the original list versus
> 58k in the new one.  But all I care about is the "enlarge" token,
> right? I'm wondering if by throwing out certain tokens with my awk
> script that I somehow corrupted that bogofilter is expecting to find
> in the database.  Its just a list of word tokens, right?  Ie if I
> throw out"magiccookieyouneedthistorun 4 10 20031110" its not going to
> affect anything.
> 
> Chris

Chris,

Your output looks great -- except for the "enlarge ... 0.415000" from
bogofilter.  The different word counts sounds good, as well.

Next to try is telling bogofilter to display what's happening in the
datastore code.  Add flags "-v -x d" when you run it.

If that doesn't show us anything useful, the next step will be to create
a .tgz with wordlist.db and the message and ftp it to me.  Hopefully,
that won't be necessary

David




More information about the Bogofilter mailing list