DB won't train, 'Unsure' about everything

Tue Feb 18 22:16:17 CET 2025

Am 18.02.25 um 11:52 schrieb Adrian via bogofilter:
> I later realised that I'd jumped to a conclusion and got it very wrong.
>
> .ENCODING probably never appeared in a mail.  bogofilter adds it to a
> new DB when you create it by training.
>
> I tried this, and it created the line
> .ENCODING 2 0 20250218
>
> So I replaced the wrong version ".ENCODING 4 0 20230304" with this new
> one.  (Needs to be done in a binary-safe way.)
>
> and it still works.  Maybe bogofilter would have added the missing line
> itself, didn't check.
>
> The three 'special' lines mostly look like ordinary tokens, but the
> spamcount column is sometimes used for a different purpose.
>
> Matthias, could you explain what the significance of the values is?

A quick search of the docs or sources could have found you this:

RELEASE.NOTES:

> [Major 0.95.0] Unicode in UTF-8
>
> This release supports Unicode (UTF-8).  A new meta-token .ENCODING has
> been added to the wordlist so that bogofilter can determine if it's
> using Unicode or not.  A value of 1 indicates raw storage and 2
> indicates UTF-8 encoded tokens.  Bogofilter checks for this meta-token
> and converts incoming text to UTF-8 as appropriate.
[please read the rest of that section, maybe online at
https://gitlab.com/bogofilter/bogofilter/-/blob/bogofilter-1.2.5/bogofilter/RELEASE.NOTES?ref_type=tags#L68
]

So a value of 4 seems wrong. If this was the result of merging wordlists
and not just exporting/importing them we should investigate if the
current beta or Git stuff were to get merging wrong, and how you do it,
so we can avoid the .ENCODING meta token values to be summed up.