switching between different databases - in 1.3.0.rc1
Rob McEwen
rob at invaluement.com
Wed Jun 11 07:11:54 CEST 2025
------ Original Message ------
>From "Matthias Andree via bogofilter" <bogofilter at bogofilter.org>
>Oh, that's a surprise (for now anyways). I would not expect order-of-magnitude speed changes in the _database_ department. For lexer issues on pathological cases (esp. with long physical lines in HTML and certain other cases), yes, but for databases, that's unexpected. Maybe even outside bogofilter, and maybe it would be more useful to re-build 1.2.5 on your Debian 12 system to see. And then I haven't used Debian or derivatives such as Ubuntu for bogofilter in ages, so I don't know what else changed in distro policies, kernel versions, and whatnot. But if "newer is faster" without being less precise, we've gone in the right direction. The important part will be turning only one knob at a time.
Matthias,
I know I've already sent you some other info - and so i normally would
wait before sending you this - but I think this might be interrelated to
some of my other info - and I want to make sure that this gets fixed
before the next version. So regarding your statement above about the
faster exporting when using bogoutil - and as I had mentioned before, I
often do training on entire large batches of messages away from
production systems, then move the resulting database file to production
usage. So to speed things up, I recently tried splitting my messages
into multiple folders, and then I had multiple instances of Bogofilter
running in separate docker.io containers processing them, and this
MASSIVELY sped things up. So then the plan was to merge the individual
databases created, thus merging them them back into one database using
this function:
mv wordlist1.db wordlist.db # this becomes the start of the new
wordlist.db
bogoutil -d wordlist2.db | bogoutil -l wordlist.db
bogoutil -d wordlist3.db | bogoutil -l wordlist.db
bogoutil -d wordlist4.db | bogoutil -l wordlist.db
So it was my understanding that bogoutil does this smartly and merges
duplicate tokens into one row, with the ham/spam counts merged, correct?
And so the idea is that this would end up in the SAME place as if
bogofilter had trained one-by-one, on the same things, with the same
settings, that these 4 example databases did, correct?
So this optimization seemed promising - EXCEPT - AFTER this merging -
when just doing a scan ("bogofilter -t < ") many emails would just hang
and the process just locked up. My theory is that in the new version,
bogoutil simply missed getting some of the mods to the main bogofilter
program? (perhaps related to the handling of weird/exotic characters?)
But that's just a guess. It could be something else. But this is most
definitely a bug.
If you want me to generate a small batch of messages and provide
examples you can replicate - let me know and I'll send that to you.
Thanks again for all that you do!
Rob McEwen, invaluement
More information about the bogofilter
mailing list