DB won't train, 'Unsure' about everything
Adrian
adrian at aeolian.org.uk
Mon Feb 17 22:48:35 CET 2025
On Mon, 17 Feb 2025 21:49:11 +0100
Matthias Andree via bogofilter <bogofilter at bogofilter.org> wrote:
> What is this "any file" that you give it? Does bogofilter understand
> what file format it is? Did you give it an empty file? What version are
> you looking at?
Yes, I gave an empty file, a raw ham file, a raw spam file, another raw
spam in Russian, and a short file containing a very rude comment.
Versions are bogofilter 1.2.5, Berkeley DB 5.3.28 running on Ubuntu
24.02.2.
>
> Possibly with lots of "-v" and maybe a few -x options? Maybe -vvvxbcdgu
> will elucidate us all. For -B I will definitely want the "reader" bit, -xb.
Will try and report back. -vvv gives a cryptic one-liner, haven't
tried the other options yet.
>
> > The source text dump looks OK, though it has a lot of non-ASCII such as
> > AU<C2><F2> 0 1 20230305 (as displayed by less)
>
> What is the encoding? There should be an .ENCODING token in the text dump.
There's this
.ENCODING 4 0 20230304
but that says the string appeared in four spams, not what encoding
they used.
The five test files used Content-Transfer-Encoding: quoted-printable
and base64. The short file was 7-bit ASCII, the empty file was - well,
empty.
>
> Also, what are the spam and ham message counts? bogoutil -d
> ~/.bogofilter/wordlist | grep MSG_COUNT should tell you.
.MSG_COUNT 6715 13369 20250213
>
> > And why should a text dump that loads without error result in a DB that
> > doesn't work??!
>
> You didn't show the bogoutil -l output, so I don't know. :-)
You mean the DB that was created? Happy to send it if you can and
would investigate it.
UPDATE
Aha! I did the binary chop!
head -n 311 wordlist.txt >shortwordlist.txt
rm wordlist.db
bogoutil -l wordlist.db <shortwordlist.txt
gives a working DB
head -n 312 wordlist.txt >shortwordlist.txt
rm wordlist.db
bogoutil -l wordlist.db <shortwordlist.txt
gives a broken DB
And line 312 is...
.ENCODING 4 0 20230304
And... I created a text file without the .ENCODING line.
...and it works.
So it looks like there's a problem when a trained mail contains one of
your reserved tokens? .MSG_COUNT and .WORDLIST_VERSION seem correct to
me, so they never occurred in training mails.
More information about the bogofilter
mailing list