switching between different databases - in 1.3.0.rc1

Rob McEwen rob at invaluement.com
Wed Jun 11 07:11:54 CEST 2025


------ Original Message ------
>From "Matthias Andree via bogofilter" <bogofilter at bogofilter.org>

>Oh, that's a surprise (for now anyways). I would not expect order-of-magnitude speed changes in the _database_ department. For lexer issues on pathological cases (esp. with long physical lines in HTML and certain other cases), yes, but for databases, that's unexpected. Maybe even outside bogofilter, and maybe it would be more useful to re-build 1.2.5 on your Debian 12 system to see. And then I haven't used Debian or derivatives such as Ubuntu for bogofilter in ages, so I don't know what else changed in distro policies, kernel versions, and whatnot. But if "newer is faster" without being less precise, we've gone in the right direction. The important part will be turning only one knob at a time.

Matthias,

I know I've already sent you some other info - and so i normally would 
wait before sending you this - but I think this might be interrelated to 
some of my other info - and I want to make sure that this gets fixed 
before the next version. So regarding your statement above about the 
faster exporting when using bogoutil - and as I had mentioned before, I 
often do training on entire large batches of messages away from 
production systems, then move the resulting database file to production 
usage. So to speed things up, I recently tried splitting my messages 
into multiple folders, and then I had multiple instances of Bogofilter 
running in separate docker.io containers processing them, and this 
MASSIVELY sped things up. So then the plan was to merge the individual 
databases created, thus merging them them back into one database using 
this function:

mv wordlist1.db wordlist.db # this becomes the start of the new 
wordlist.db
bogoutil -d wordlist2.db | bogoutil -l wordlist.db
bogoutil -d wordlist3.db | bogoutil -l wordlist.db
bogoutil -d wordlist4.db | bogoutil -l wordlist.db

So it was my understanding that bogoutil does this smartly and merges 
duplicate tokens into one row, with the ham/spam counts merged, correct? 
And so the idea is that this would end up in the SAME place as if 
bogofilter had trained one-by-one, on the same things, with the same 
settings, that these 4 example databases did, correct?

So this optimization seemed promising - EXCEPT - AFTER this merging - 
when just doing a scan ("bogofilter -t < ") many emails would just hang 
and the process just locked up. My theory is that in the new version, 
bogoutil simply missed getting some of the mods to the main bogofilter 
program? (perhaps related to the handling of weird/exotic characters?) 
But that's just a guess. It could be something else. But this is most 
definitely a bug.

If you want me to generate a small batch of messages and provide 
examples you can replicate - let me know and I'll send that to you.

Thanks again for all that you do!

Rob McEwen, invaluement


More information about the bogofilter mailing list