[PATCH] combined wordlist a.k.a. single list

David Relson relson at osagesoftware.com
Thu Jun 5 01:15:57 CEST 2003


At 06:39 PM 6/4/03, Jeremy Blosser wrote:
>On Jun 04, Greg Louis [glouis at dynamicro.on.ca] wrote:
> > On 20030603 (Tue) at 1958:49 -0500, Jeremy Blosser wrote:
> > >
> > > Our goodlist is a pretty important resource for us. ...
> >
> > > Our spamlist is less important to us. ...
> >
> > > Also, our spam list is more generic and not really at all 
> company-sensitive
> > > information ...
> > >
> > > These may just sound academic, but this is how we find ourselves 
> looking at
> > > our lists ...
> >
> > Very clearly put, thank you!  Do you think it would alleviate your
> > concerns if bogoutil had a way to split out separate spam and nonspam
> > lists, either in the dump format or as new .db files, though the single
> > list were actually to be used in production?
>
>I honestly still prefer it how it is, but if they're going to merge this
>would probably be a reasonable compromise.

Hi Jeremy,

With the current two list implementation, "bogoutil -d" writes lines of 
form "token, count, timestamp".  A sample of such a line is "hello 1234 
20030603".  Lines written from spamlist.db and goodlist.db are of the same 
format.

With the single list implementation, lines are formatted as "token, spam 
count, good count, timestamp".  The output can be split into ham (or spam) 
via a simple awk script, as in:

bogoutil -d wordlist.db | awk '{print $1 $2 $4}' > spamlist.txt
bogoutil -d wordlist.db | awk '{print $1 $3 $4}' > goodlist.txt

This same basic mechanism is in bogoupgrade for converting two lists into one:

( bogoutil -d $spam | awk '{printf "%s %d 0 %d\n", $1, $2, $3}' ; \
   bogoutil -d $good | awk '{printf "%s 0 %d %d\n", $1, $2, $3}' ) \
| sort | bogoutil -l wordlist.db


After running bogoupgrade, wordlist.db can be made smaller (for optimal 
BerkeleyDB performance) by running command:

bogoutil -d wordlist.db | bogoutil -l wordlist.db.new

I've seen this extra step reduce the db size by almost 50%.  It seems that 
the order in which a db is populated can have a major effect on its size.

It's very likely that bogofilter can be made smart enough to detect whether 
BOGOFILTER_DIR has one wordlist or two and to operate happily with whatever 
it finds.

However, I'll be away from my computers from this Friday thru next, so any 
such changes will be a while.

Cheers,

David





More information about the bogofilter-dev mailing list