In 1.3.0.rc1, for ASCII (Windows-1252) emails, Bogofilter "hangs" on encoding labels
Rob McEwen
rob at invaluement.com
Thu Jun 12 05:59:53 CEST 2025
Matthias,
In my previous email, my theories about there being
import/export/merging bugs with bogoutil - those are most likely NOT
correct theories. So please disregard that previous email (quoted at the
end of this email) since that will likely send you in the wrong
direction. (But my OTHER emails about OTHER issues are still valid!)
And there's still a problem here - just a different issue. So it turns
out that, instead, for ASCII (Windows-1252) emails (and possibly
others?), Bogo "hangs" on many types of encoding labels found in those
emails. This was hard to troubleshoot because it seems like it has to
have SOME kind of token (that's in the wordlist database) that is found
inside that item (or line? or nearby?) for this error to occur. So the
presence of these labels ALONE doesn't seem to trigger the error. For
example, I took one of the messages that was having this problem, and
then I ran that message against a freshly created Bogofilter database
that was trained on only just a few spams/hams - and it THEN didn't
product this error.
So when checking messages using my regular wordlist database - a
situation where these errors were consistently happening on certain
emails - I then took some of the messages that were consistently having
this error - and simply deleted all of the following types of strings in
those emails, and then this error consistently went away (once these
strings were removed!):
=?us-ascii?Q?
=?utf-8?B?
...etc - there are many others similar to this, that produced the same
error - all various types of encoding directives.
So - AS A TEMPORARY WORKAROUND - I then changed they way that my apps
that use Bogofilter checks these messages by doing the following:
awk '{gsub(/=\?[A-Za-z0-9_-]{3,30}\?[BbQq]\?/, ""); print}'
/path-to-msg/msg-file-name | bogofilter -t"
While not perfect, this workaround helps much. But please look into this
as this likely shouldn't happen. And the way it "hangs" is also not good
- it seems to be stuck in an infinite loop, never returning back, nor
giving an error.
Also, most people use Bogofilter in a situation where the encoding of
the emails is either UTF-8 or iso-8859-1, so maybe that explains why
this bug was missed in testing?
EXAMPLES OF WHAT CAUSES THIS TO TRIGGER:
(again, these by themselves are likely not going to have an issue - it
seems like it takes this existing PLUS something on that same line being
in the wordlist database)
X-MS-Exchange-AntiSpam-MessageData-0:
=?utf-8?B?SWVja3RaWklkUUVFcVJrK3R2VExMcmd4L3EyZkx3RjFpQVM3dkFHS2FVcVRo?=
=?utf-8?B?Nk04eDBDUmlMSUZBRFRhUXc3TUFzbS9teUJKN1RNaW9PZDh2Uk42Q0hWdWZR?=
=?utf-8?B?bkF6TUVXWkR5OU9xWDVKNWkvOHBvalBYSmVvdmRxN29CZnk4ekthU1RHV3Rj?=
=?utf-8?B?bVhaY0pRa0psanlhTHNyb01pWlpKV3NsODlFajVabHplSlF1UTFwU2ZUQ2dJ?=
filename="=?utf-8?B?8J+Siy0tX18yOSBZZWFycyBPbGRlciBNb23wn5iYTmVlZCBhIFJlZ3VsYXIg?=
Subject: =?utf-8?Q?Legal_outsourcing_needs_=F0=9F=A4=9D=F0=9F=A4=9D?=
x-ms-exchange-antispam-messagedata-0:
=?us-ascii?Q?N5e1vWvupvF6lDpb09cJzuLmQrdL3JbAD9aZlp6QJg8bNwlrEfOefIKL1ih4?=
=?us-ascii?Q?zfmiKtKM1ufFc2KLF0b+HD7jU9IQ79C7RRoohhexEGphLm+t+JTNPFZA4K55?=
=?us-ascii?Q?BZc0d0KUZ7uRPI/Wus3eZvqmnQ9TAzSdOeh2E1F4yNZZ9neeGzaEU0215ZBn?=
Those are excepts from emails that had this issue, then worked when the
"=?utf-8?B?" part (etc.) was removed.
I hope this helps and makes sense! Thanks again for all that you do for
Bogofilter!
Rob McEwen, invaluement
------ Original Message ------
>From "Rob McEwen" <rob at invaluement.com>
To "Matthias Andree" <matthias.andree at gmx.de>; bogofilter at bogofilter.org
Date 6/11/2025 1:11:54 AM
Subject Re: switching between different databases - in 1.3.0.rc1
>------ Original Message ------
>From "Matthias Andree via bogofilter" <bogofilter at bogofilter.org>
>
>>Oh, that's a surprise (for now anyways). I would not expect order-of-magnitude speed changes in the _database_ department. For lexer issues on pathological cases (esp. with long physical lines in HTML and certain other cases), yes, but for databases, that's unexpected. Maybe even outside bogofilter, and maybe it would be more useful to re-build 1.2.5 on your Debian 12 system to see. And then I haven't used Debian or derivatives such as Ubuntu for bogofilter in ages, so I don't know what else changed in distro policies, kernel versions, and whatnot. But if "newer is faster" without being less precise, we've gone in the right direction. The important part will be turning only one knob at a time.
>
>Matthias,
>
>I know I've already sent you some other info - and so i normally would
>wait before sending you this - but I think this might be interrelated
>to some of my other info - and I want to make sure that this gets fixed
>before the next version. So regarding your statement above about the
>faster exporting when using bogoutil - and as I had mentioned before, I
>often do training on entire large batches of messages away from
>production systems, then move the resulting database file to production
>usage. So to speed things up, I recently tried splitting my messages
>into multiple folders, and then I had multiple instances of Bogofilter
>running in separate docker.io containers processing them, and this
>MASSIVELY sped things up. So then the plan was to merge the individual
>databases created, thus merging them them back into one database using
>this function:
>
>mv wordlist1.db wordlist.db # this becomes the start of the new
>wordlist.db
>bogoutil -d wordlist2.db | bogoutil -l wordlist.db
>bogoutil -d wordlist3.db | bogoutil -l wordlist.db
>bogoutil -d wordlist4.db | bogoutil -l wordlist.db
>
>So it was my understanding that bogoutil does this smartly and merges
>duplicate tokens into one row, with the ham/spam counts merged,
>correct? And so the idea is that this would end up in the SAME place as
>if bogofilter had trained one-by-one, on the same things, with the same
>settings, that these 4 example databases did, correct?
>
>So this optimization seemed promising - EXCEPT - AFTER this merging -
>when just doing a scan ("bogofilter -t < ") many emails would just hang
>and the process just locked up. My theory is that in the new version,
>bogoutil simply missed getting some of the mods to the main bogofilter
>program? (perhaps related to the handling of weird/exotic characters?)
>But that's just a guess. It could be something else. But this is most
>definitely a bug.
>
>If you want me to generate a small batch of messages and provide
>examples you can replicate - let me know and I'll send that to you.
>
>Thanks again for all that you do!
>
>Rob McEwen, invaluement
>
More information about the bogofilter
mailing list