bogoutil -s -m changes wordlist encoding?

Sat Dec 31 19:02:41 CET 2005

On Sat, 31 Dec 2005 16:16:33 +0000 (UTC)
Jason Lunz wrote:

> I'm looking at the effect of bogoutil -m on my wordlist by running it on
> a copy, then comparing the output of "bogoutil -d" from the before and
> after wordlists. When I tried "bogoutil -s 1,40 -m new/wordlist.db", the
> diff showed not only that the short/long tokens had been removed, but
> also that non-ascii tokens were changed.
> 
> For example, here's the first few hunks of the diff of the "bogoutil -d"
> output for the before and after wordlists:

...[snip]...

> Is this expected? It looks to me like the tokens have been corrupted
> in the after wordlist.
> 
> Jason

Hello Jason,

Good question!  The short answer is "Yes, this is expected."

Back in June (with release 0.95.0), unicode became the default encoding
for bogofilter's tokens.  When run, bogofilter checks for a special
token (named ".ENCODING") and that tells it whether the wordlist uses
unicode or not.  This information is used so that new tokens are
encoded in the same manner as already existing tokens.

When you specified "maintenance mode" with bogoutil's "-m" option,
bogoutil updated the encoding of your database.  That's why you now
have the .ENCODING token.  Also, since unicode representations of
special characters are different than iso-8859-1 representations (the
older default encoding), the output of "bogoutil -d" is different.

If you care to, you can compare bogofilter scores with the new and old
wordlists, i.e.

  NEW=`bogofilter -v -d new.db < test.message`
  OLD=`bogofilter -v -d old.db < test.message`
  if [ "$NEW" != "$OLD" ] ; then
     echo different:  NEW $NEW, OLD $OLD
  else
     echo same
  fi

In all cases you should see "same" being output.

HTH,

David

P.S.  It's good to have people looking closely at what bogofilter does
and to have them asking questions.  Keep it up!