DB won't train, 'Unsure' about everything

Tue Feb 18 23:31:32 CET 2025

On Tue, 18 Feb 2025 22:16:17 +0100
Matthias Andree via bogofilter <bogofilter at bogofilter.org> wrote:
> > This release supports Unicode (UTF-8).  A new meta-token .ENCODING has
> > been added to the wordlist so that bogofilter can determine if it's
> > using Unicode or not.  A value of 1 indicates raw storage and 2
> > indicates UTF-8 encoded tokens.  Bogofilter checks for this meta-token
> > and converts incoming text to UTF-8 as appropriate.  
> [please read the rest of that section, maybe online at

> So a value of 4 seems wrong. If this was the result of merging wordlists
> and not just exporting/importing them we should investigate if the
> current beta or Git stuff were to get merging wrong, and how you do it,
> so we can avoid the .ENCODING meta token values to be summed up.

I'm puzzled.  I have what I believe is the text dump from the Sqlite3
wordlist.db of my previous install, and that has the .ENCODING 4, with
a date of 20230304.  I don't think that original DB was created from a
merge.  I didn't know about text dumping and loading then, unless I'd
known it and forgotten it. The 'file' command identifies the file as
Non-ISO extended-ASCII text.

I created a fairly small wordlist.db a few days ago by training a few
mails, and the text dump has .ENCODING 2.  The 'file' command
gives it as Unicode text, UTF-8 text.  I can't see a single character
in that small file that isn't in the 7-bit ASCII set so I don't know
what makes it UTF-8 to the file command.

The wordlist.txt that I used to create my current working DB is loaded
from the original text dump with the only change being changing to
.ENCODING 2 0 20250218.

I'm still investigating the encodings used, there seems to be Unicode
and raw binary.  I guess the 'file' command on raw text just reads
the beginning to decide.