importing words from popfile in to bogofilter

David Relson relson at osagesoftware.com
Sat Sep 3 14:20:07 CEST 2011


On Fri, 2 Sep 2011 09:54:56 -0700 (PDT)
Joseph Harth wrote:

> Hi David 
> I exported my word datbase to two files. spam.txt and ham.txt. This
> list hast all the workds in my database repeted as many times as they
> were repeted in the database. How cam I get this words in bogo
> filter? It is just a plain text file with words. I also filter the
> files for dictionary words with aspell. 
> 

"bogoutil -l wordlist.db token_file.txt" will load the entries in
token_file.txt into wordlist.db

The format of token_file.txt is:

token1 spam_count ham_count date
token2 spam_count ham_count date
.MSG_COUNT spam_messages ham_messages date

Where spam_count is the number of times the token has been seen in
spam messages and ham_count is the count for ham messages.

Where spam_messages and ham_messages are the number of spam and ham
messages processed in building the wordlist.  If popfile doesn't have
that information, you'll have to make up this information.  Note: a
reasonable estimate for spam_messages might be double (2x) the largest
spam_count and 2x ham_count for ham_messages.

The date field is the date the tokens are entered into the wordlist.
The format is YYYYMMDD.

You'll have to do some experimenting to determine the best way to
convert your spam and ham files to the format needed by bogoutil.  The
following lines are (approx) what you'll need:

  sort < spam.txt | uniq -c | awk '{print $2 $1 0}' | bogoutil -l wordlist.db
  sort < ham.txt | uniq -c | awk '{print $2 0 $1}' | bogoutil -l wordlist.db
  bogoutil -d wordlist.db

I've written the above commands "off the cuff" so you may have to
tweak them a bit before they actually work.  I've also left
out .MSG_COUNT 

Hope this helps,

David



More information about the Bogofilter mailing list