[PATCH] combined wordlist a.k.a. single list
David Relson
relson at osagesoftware.com
Thu Jun 5 01:15:57 CEST 2003
At 06:39 PM 6/4/03, Jeremy Blosser wrote:
>On Jun 04, Greg Louis [glouis at dynamicro.on.ca] wrote:
> > On 20030603 (Tue) at 1958:49 -0500, Jeremy Blosser wrote:
> > >
> > > Our goodlist is a pretty important resource for us. ...
> >
> > > Our spamlist is less important to us. ...
> >
> > > Also, our spam list is more generic and not really at all
> company-sensitive
> > > information ...
> > >
> > > These may just sound academic, but this is how we find ourselves
> looking at
> > > our lists ...
> >
> > Very clearly put, thank you! Do you think it would alleviate your
> > concerns if bogoutil had a way to split out separate spam and nonspam
> > lists, either in the dump format or as new .db files, though the single
> > list were actually to be used in production?
>
>I honestly still prefer it how it is, but if they're going to merge this
>would probably be a reasonable compromise.
Hi Jeremy,
With the current two list implementation, "bogoutil -d" writes lines of
form "token, count, timestamp". A sample of such a line is "hello 1234
20030603". Lines written from spamlist.db and goodlist.db are of the same
format.
With the single list implementation, lines are formatted as "token, spam
count, good count, timestamp". The output can be split into ham (or spam)
via a simple awk script, as in:
bogoutil -d wordlist.db | awk '{print $1 $2 $4}' > spamlist.txt
bogoutil -d wordlist.db | awk '{print $1 $3 $4}' > goodlist.txt
This same basic mechanism is in bogoupgrade for converting two lists into one:
( bogoutil -d $spam | awk '{printf "%s %d 0 %d\n", $1, $2, $3}' ; \
bogoutil -d $good | awk '{printf "%s 0 %d %d\n", $1, $2, $3}' ) \
| sort | bogoutil -l wordlist.db
After running bogoupgrade, wordlist.db can be made smaller (for optimal
BerkeleyDB performance) by running command:
bogoutil -d wordlist.db | bogoutil -l wordlist.db.new
I've seen this extra step reduce the db size by almost 50%. It seems that
the order in which a db is populated can have a major effect on its size.
It's very likely that bogofilter can be made smart enough to detect whether
BOGOFILTER_DIR has one wordlist or two and to operate happily with whatever
it finds.
However, I'll be away from my computers from this Friday thru next, so any
such changes will be a while.
Cheers,
David
More information about the bogofilter-dev
mailing list