In 1.3.0.rc1, for ASCII (Windows-1252) emails, Bogofilter "hangs" on encoding labels

Thu Jun 12 05:59:53 CEST 2025

Matthias,

In my previous email, my theories about there being 
import/export/merging bugs with bogoutil - those are most likely NOT 
correct theories. So please disregard that previous email (quoted at the 
end of this email) since that will likely send you in the wrong 
direction. (But my OTHER emails about OTHER issues are still valid!)

And there's still a problem here - just a different issue. So it turns 
out that, instead, for ASCII (Windows-1252) emails (and possibly 
others?), Bogo "hangs" on many types of encoding labels found in those 
emails. This was hard to troubleshoot because it seems like it has to 
have SOME kind of token (that's in the wordlist database) that is found 
inside that item (or line? or nearby?) for this error to occur. So the 
presence of these labels ALONE doesn't seem to trigger the error. For 
example, I took one of the messages that was having this problem, and 
then I ran that message against a freshly created Bogofilter database 
that was trained on only just a few spams/hams - and it THEN didn't 
product this error.

So when checking messages using my regular wordlist database - a 
situation where these errors were consistently happening on certain 
emails - I then took some of the messages that were consistently having 
this error - and simply deleted all of the following types of strings in 
those emails, and then this error consistently went away (once these 
strings were removed!):

=?us-ascii?Q?
=?utf-8?B?
...etc - there are many others similar to this, that produced the same 
error - all various types of encoding directives.

So - AS A TEMPORARY WORKAROUND - I then changed they way that my apps 
that use Bogofilter checks these messages by doing the following:

awk '{gsub(/=\?[A-Za-z0-9_-]{3,30}\?[BbQq]\?/, ""); print}' 
/path-to-msg/msg-file-name | bogofilter -t"

While not perfect, this workaround helps much. But please look into this 
as this likely shouldn't happen. And the way it "hangs" is also not good 
- it seems to be stuck in an infinite loop, never returning back, nor 
giving an error.

Also, most people use Bogofilter in a situation where the encoding of 
the emails is either UTF-8 or iso-8859-1, so maybe that explains why 
this bug was missed in testing?

EXAMPLES OF WHAT CAUSES THIS TO TRIGGER:
(again, these by themselves are likely not going to have an issue - it 
seems like it takes this existing PLUS something on that same line being 
in the wordlist database)

X-MS-Exchange-AntiSpam-MessageData-0:

=?utf-8?B?SWVja3RaWklkUUVFcVJrK3R2VExMcmd4L3EyZkx3RjFpQVM3dkFHS2FVcVRo?=

=?utf-8?B?Nk04eDBDUmlMSUZBRFRhUXc3TUFzbS9teUJKN1RNaW9PZDh2Uk42Q0hWdWZR?=

=?utf-8?B?bkF6TUVXWkR5OU9xWDVKNWkvOHBvalBYSmVvdmRxN29CZnk4ekthU1RHV3Rj?=

=?utf-8?B?bVhaY0pRa0psanlhTHNyb01pWlpKV3NsODlFajVabHplSlF1UTFwU2ZUQ2dJ?=

filename="=?utf-8?B?8J+Siy0tX18yOSBZZWFycyBPbGRlciBNb23wn5iYTmVlZCBhIFJlZ3VsYXIg?=

Subject: =?utf-8?Q?Legal_outsourcing_needs_=F0=9F=A4=9D=F0=9F=A4=9D?=

x-ms-exchange-antispam-messagedata-0:

=?us-ascii?Q?N5e1vWvupvF6lDpb09cJzuLmQrdL3JbAD9aZlp6QJg8bNwlrEfOefIKL1ih4?=

=?us-ascii?Q?zfmiKtKM1ufFc2KLF0b+HD7jU9IQ79C7RRoohhexEGphLm+t+JTNPFZA4K55?=

=?us-ascii?Q?BZc0d0KUZ7uRPI/Wus3eZvqmnQ9TAzSdOeh2E1F4yNZZ9neeGzaEU0215ZBn?=

Those are excepts from emails that had this issue, then worked when the 
"=?utf-8?B?" part (etc.) was removed.

I hope this helps and makes sense! Thanks again for all that you do for 
Bogofilter!

Rob McEwen, invaluement

------ Original Message ------
>From "Rob McEwen" <rob at invaluement.com>
To "Matthias Andree" <matthias.andree at gmx.de>; bogofilter at bogofilter.org
Date 6/11/2025 1:11:54 AM
Subject Re: switching between different databases - in 1.3.0.rc1

>------ Original Message ------
>From "Matthias Andree via bogofilter" <bogofilter at bogofilter.org>
>
>>Oh, that's a surprise (for now anyways). I would not expect order-of-magnitude speed changes in the _database_ department. For lexer issues on pathological cases (esp. with long physical lines in HTML and certain other cases), yes, but for databases, that's unexpected. Maybe even outside bogofilter, and maybe it would be more useful to re-build 1.2.5 on your Debian 12 system to see. And then I haven't used Debian or derivatives such as Ubuntu for bogofilter in ages, so I don't know what else changed in distro policies, kernel versions, and whatnot. But if "newer is faster" without being less precise, we've gone in the right direction. The important part will be turning only one knob at a time.
>
>Matthias,
>
>I know I've already sent you some other info - and so i normally would 
>wait before sending you this - but I think this might be interrelated 
>to some of my other info - and I want to make sure that this gets fixed 
>before the next version. So regarding your statement above about the 
>faster exporting when using bogoutil - and as I had mentioned before, I 
>often do training on entire large batches of messages away from 
>production systems, then move the resulting database file to production 
>usage. So to speed things up, I recently tried splitting my messages 
>into multiple folders, and then I had multiple instances of Bogofilter 
>running in separate docker.io containers processing them, and this 
>MASSIVELY sped things up. So then the plan was to merge the individual 
>databases created, thus merging them them back into one database using 
>this function:
>
>mv wordlist1.db wordlist.db # this becomes the start of the new 
>wordlist.db
>bogoutil -d wordlist2.db | bogoutil -l wordlist.db
>bogoutil -d wordlist3.db | bogoutil -l wordlist.db
>bogoutil -d wordlist4.db | bogoutil -l wordlist.db
>
>So it was my understanding that bogoutil does this smartly and merges 
>duplicate tokens into one row, with the ham/spam counts merged, 
>correct? And so the idea is that this would end up in the SAME place as 
>if bogofilter had trained one-by-one, on the same things, with the same 
>settings, that these 4 example databases did, correct?
>
>So this optimization seemed promising - EXCEPT - AFTER this merging - 
>when just doing a scan ("bogofilter -t < ") many emails would just hang 
>and the process just locked up. My theory is that in the new version, 
>bogoutil simply missed getting some of the mods to the main bogofilter 
>program? (perhaps related to the handling of weird/exotic characters?) 
>But that's just a guess. It could be something else. But this is most 
>definitely a bug.
>
>If you want me to generate a small batch of messages and provide 
>examples you can replicate - let me know and I'll send that to you.
>
>Thanks again for all that you do!
>
>Rob McEwen, invaluement
>