DB won't train, 'Unsure' about everything
Adrian
adrian at aeolian.org.uk
Tue Feb 18 23:31:32 CET 2025
On Tue, 18 Feb 2025 22:16:17 +0100
Matthias Andree via bogofilter <bogofilter at bogofilter.org> wrote:
> > This release supports Unicode (UTF-8). A new meta-token .ENCODING has
> > been added to the wordlist so that bogofilter can determine if it's
> > using Unicode or not. A value of 1 indicates raw storage and 2
> > indicates UTF-8 encoded tokens. Bogofilter checks for this meta-token
> > and converts incoming text to UTF-8 as appropriate.
> [please read the rest of that section, maybe online at
> So a value of 4 seems wrong. If this was the result of merging wordlists
> and not just exporting/importing them we should investigate if the
> current beta or Git stuff were to get merging wrong, and how you do it,
> so we can avoid the .ENCODING meta token values to be summed up.
I'm puzzled. I have what I believe is the text dump from the Sqlite3
wordlist.db of my previous install, and that has the .ENCODING 4, with
a date of 20230304. I don't think that original DB was created from a
merge. I didn't know about text dumping and loading then, unless I'd
known it and forgotten it. The 'file' command identifies the file as
Non-ISO extended-ASCII text.
I created a fairly small wordlist.db a few days ago by training a few
mails, and the text dump has .ENCODING 2. The 'file' command
gives it as Unicode text, UTF-8 text. I can't see a single character
in that small file that isn't in the 7-bit ASCII set so I don't know
what makes it UTF-8 to the file command.
The wordlist.txt that I used to create my current working DB is loaded
from the original text dump with the only change being changing to
.ENCODING 2 0 20250218.
I'm still investigating the encodings used, there seems to be Unicode
and raw binary. I guess the 'file' command on raw text just reads
the beginning to decide.
More information about the bogofilter
mailing list