Problem compacting databases

David Relson relson at osagesoftware.com
Wed Jan 19 13:35:03 CET 2005


On Wed, 19 Jan 2005 12:01:36 +0100
Juan J. Martinez wrote:

> Hello!
> 
> After reading FAQ instructions
> (http://bogofilter.sourceforge.net/faq.shtml#compact-database) I try to
> compact my database and I get the following error:
> 
> # bogoutil -v -d wordlist.db | bogoutil -v -l wordlist.db.new
> bogoutil: Unexpected input [d'informÃ] on line 18719. Expecting
> whitespace before count.

Sounds like something strange in that line or the one before it.
"bogoutil -d" produces a sequence of lines with 4 items per line (the
token, its spam and ham counts, and the timestamp) with spaces between
them.  The error message indicates that one line has been interpreted as
having more than 4 fields in it, which should never happen!

I suggest running "bogoutil -d wordlist.db | tail wordlist.txt" then
gzipping wordlist.txt and posting that.  That will give me a chance to
see exactly what's there (without any additional character
translations).
 
> The wordlist.db.new is pretty small, but seems bogoutil exited when the
> error was found and most of the db was lost.
>
> I don't know If the db is corrupted, but seems so.

Running "db_verify wordlist.db" will tell if the wordlist is corrupted.
>From your description, the problem sounds more like an unexpected
character in the database than database corruption.

> That's bogofilter 0.92.8 (with BerkeleyDB 4.2.52).
> 
> I think that was a message without charset, or wrong charset (it should
> be iso-8859-15 or iso-8859-1).
> 
> There's any problem with bogofilter and some charsets?

Bogofilter's handling of charsets is fairly simple.  The lexer code
(generated by flex from source file lexer_v3.l) prefers 7-bit
characters.  To accommodate this  a number of characters (in iso-8859-1
and iso-8859-15) are translated (by simple table lookup).  The method
works quite satisfactorily.

Regards,

David



More information about the Bogofilter mailing list