switching between different databases - in 1.3.0.rc1

Tue Jun 3 22:44:19 CEST 2025

Am 29.05.25 um 20:22 schrieb Rob McEwen via bogofilter:
> From "Matthias Andree" <matthias.andree at gmx.de>
>> Should you decide to do anything of profiling/performance metrics and 
>> you identify hot spots or I/O slowdowns somewhere, please share your 
>> findings.
>
> Matthias,
>
> So as I was starting to do some performance testing, I noticed a few 
> interesting things - and hopefully if you'll deem some of my resulting 
> suggestions worthy enough to be acted upon? ...perhaps making it into 
> RC2?
>
> One of the things that I've noticed is that RC1 extracts MUCH MORE 
> additional types of tokens than did 1.2.5 - especially with certain 
> types of emails - and that is overall EXCELLENT - but it does come 
> with caveats/concerns. So I think this can potentially cause 
> performance issues? But I'm STILL very glad for this additional and 
> helpful data - so please don't remove or reverse this additional data 
> - I just think think there are some helpful workarounds that might in 
> SOME cases help mitigate that if/when such issues occur (or this also 
> might lead to some performance optimizations too?)

Rob, thanks for the two long messages on this subject.

Bogofilter 1.3.0.rc* fixed a truckload of bugs that were in 1.2.5.  
There have been shy of 200 commits since that older release.

We've had contributions lingering in sourceforge's bug tracker and there 
were several high-quality bug reports that resulted in fixes. I rewrote 
the MIME *header* decoding (RFC-2047 like), which was recursing which it 
should not have done, and also prevented some of the lexer rules from 
greedily eating up entire chunks because a few of our regular 
expressions in the .l file (flex source file) were not designed well, 
which I've rectified. Of course, such bug fixes change the outcome, and 
of course, that can change performance. I hope for the better.

>
> Here are some obervations:
>
> (1) Even for a freshly-generated database where legit and spam emails 
> from directories were processed, and the wordlist file was missing 
> before doing that - even then - it's interesting how these two 
> commands below signicantly reduce the file size - and in some 
> scenarios - and in some cursory testing - this alone significantly 
> improved performance on individual message scans. (But that might be 
> more due to the VPS not having sufficient RAM to begin with - but the 
> sys I tested this on should have had plenty of RAM to prevent that 
> from being an issue - but I probably should have tested this with a 
> VPS with an obscene amount of extra RAM and/or dived deep into the RAM 
> usage to be sure - I didn't do that.) But, at the least, this by 
> itself is worthy of further investigation. (a) how much does this 
> reduce file size and RAM on a freshly-generated db, and (b) how much 
> does this alone speed up message scans?
> bogoutil -d ~/.bogofilter/wordlist.db | bogoutil -l 
> ~/.bogofilter/wordlist_temp.db
> mv /root/.bogofilter/wordlist_temp.db /root/.bogofilter/wordlist.db
> (2) As a side note - this use of bogoutil for filtering/altering of 
> the data - runs order of magnitudes faster in 1.3.0.rc1 than it did in 
> 1.2.5 - I didn't test it scientifically - but just from human 
> experience, a comparable database that takes about 10 seconds to do 
> this in 1.2.5, seems to do this conversation (above) in about 1/2 a 
> second in 1.3.0.rc1 - EXCELLENT!

Oh, that's a surprise (for now anyways). I would not expect 
order-of-magnitude speed changes in the _database_ department. For lexer 
issues on pathological cases (esp. with long physical lines in HTML and 
certain other cases), yes, but for databases, that's unexpected. Maybe 
even outside bogofilter, and maybe it would be more useful to re-build 
1.2.5 on your Debian 12 system to see. And then I haven't used Debian or 
derivatives such as Ubuntu for bogofilter in ages, so I don't know what 
else changed in distro policies, kernel versions, and whatnot. But if 
"newer is faster" without being less precise, we've gone in the right 
direction. The important part will be turning only one knob at a time.

Message scanning with bogofilter or bogolexer did change on purpose. 
More tokens may make the database slower, more tokens in RAM might cause 
memory to get tighter, but avoiding stupid things in the lexer may make 
things considerably faster.

>
> (3) This extra data I mentioned in point 1 (the additional/new types 
> of tokens collected) can extremely bloat the size of the database - 
> which then bloats RAM usage - and again - I'm seeing indications that 
> this can potentially significantly impact performance (for whatever 
> reasons, everone's "mileage may vary" of course!). Therefore, I've 
> found that removing all tokens from the database that didn't have at 
> least 2+ hits in at least one of the spam or legit categories - helps 
> to SIGNIFICANTLY reduce the size of data overall, and while I admit 
> that this probably will reduce effectiveness, there's a strong 
> argument that the increased effectiveness of the new types of tokens 
> now included in 1.3.0.rc1 - more than makes up for the reduced 
> effectiveness effectiveness caused by running these two commands 
> (below), which greatly reduce the size of the database (and thus 
> reduce RAM usage). Also, keep in mind that the faster/slower 
> performance (depending on the situation) might have OTHER causes 
> besides raw RAM usage!
> bogoutil -d ~/.bogofilter/wordlist.db | awk '$2 > 1 || $3 > 1' 
> bogoutil -l  ~/.bogofilter/wordlist_temp.db
> mv /root/.bogofilter/wordlist_temp.db /root/.bogofilter/wordlist.db

bogoutil has a -c <number> option to do just what your awk did, without 
using awk :-) If you want. I think we used the term "singleton"s for 
such tokens in the earlier years of bogofilter.

Of course rebuilding a smaller database with bogoutil -l may make for a 
better page fill and through improved locality in your RAID improve 
performance if it doesn't persist cached in RAM or comes from fast SSD. 
Some databases (SQLite3 for one) have cleanup commands (such as SQL's 
VACUUM) to compact the database if many data have expired from it.

>
> (4) So basically, in 1.2.5, one strategy I found effective was to have 
> a completely separate 2nd Bogofilter implementation which uses the 
> following (controversial?) settings:
>
> min-token-len=2
> max-token-len=48
> multi-token-count=3
> max-multi-token-len=48
>
> And this most definately already bloats the size of the database in 
> 1.2.5 (but not due to a bug - such bloating is suppose to occur based 
> on these settings!) - so then I used that same awk '$2 > 1 || $3 > 1' 
> filtering stragegy to combine this multi-token strategy with removing 
> one-off "unique" hits. It was very successful. The problem? So when 
> combining this multi-word strategy with ALSO trying to do this on 
> 1.3.0.rc1 - which also bloats the data with additional types of tokens 
> that 1.2.5 never included - then the bloating (and processing time and 
> file sizes!) goes absoutely "parabolic" and sort "multiples" - and 
> then - this causes bogofilter's processing of large message stores - 
> that previously took something like 20 minutes to process - that now 
> appears to be requiring potentially DAYS OF TIME - and probably 10s or 
> 100s of gigabytes to get there - who knows - but during testing - it 
> was insane. So doing this multi-word strategy is simply unworkable on 
> 1.3.0.rc1 (and least for most situations/setups). So I guess for now 
> I'm stuck keeping this part of my filtering process on 1.2.5 ?

That's a good question. If you have a small but striking palpable 
example of a few messages where that multi-token strategy runs much 
worse in 1.3.0 now vs. 1.2.5, I'd be willing to have a look. Feel free 
to anonymize headers if that doesn't change the outcome, I suppose many 
tokens will be from message bodies anyways. Especially HTML may yield 
many more tokens.

>
> So related to this - while I absolutely do want the new Bogofilter to 
> continue to include these new types of tokens it's finding in the new 
> version - please don't remove that feature - that is overall an 
> excellent improvement - but I do have a feature request - please 
> consider adding a setting or command line option - to OPTIONALLY 
> revert that back - perhaps even have it be based on levels. such as this:
>
> # TokenCollectionLvl 1 # similar to 1.2.5's in collecting of tokens
> # TokenCollectionLvl 2 # a compromise between 1.2.5 and the new 
> version's more aggressive collecting of tokens
> TokenCollectionLvl 3 # DEFAULT- uses the new version's collecting of 
> additional types of tokens
>
> (+ command line option for that?)
>
> Or something like that?

If it were so easy. We'd need to isolate which changes made that huge 
difference for you and then see if we can make it an option. Those "200 
changes since 1.2.5" don't scare me too much, that's going to be tracked 
down in like 5 ... 20 builds.
Adding an option for that would likely be the easier part, finding out 
which of the changes caused the many more tokens to appear is the hard 
one (see above for info that might help me find out).

>
> (5) btw - this applies to all versions - I'm finding that - during 
> regular usage of Bogofilter - nothing exotic - so using the default 
> min-token-len=3 setting - those tokens which start with "head" but are 
> only 3 in length - overall cause too many False Positives to be 
> worthwhile. But changing this to setting 4-in-length would then cause 
> too many False Negatives.  So I use that import/export to fix this, 
> where I added this little gem:
>
> grep -avE '^head:.{3}\s'
>
> So then instead of the following:
>
> bogoutil -d ~/.bogofilter/wordlist.db | awk '$2 > 1 || $3 > 1' | 
> bogoutil -l  ~/.bogofilter/wordlist_temp.db
>
> I instead run:
>
> bogoutil -d ~/.bogofilter/wordlist.db | awk '$2 > 1 || $3 > 1' | grep 
> -avE '^head:.{3}\s' | bogoutil -l ~/.bogofilter/wordlist_temp.db

So do I understand you correctly in that it would help to have different 
minimal token lengths for headers and bodies? Because for the messages 
you receive, your setup would benefit from 3-character-long tokens for 
bodies but doing head:... with just 3 characters messes up?

>
> (6) Note that my use of the "-a" option in grep here - might also be 
> filtering some thing out - which in some cases really should be there? 
> So that's another factor that probably deserves some extra research - 
> I just wanted everyone who reads this email and might try these things 
> - to be aware of that and to be sure to research that part for 
> themselves.

GNU grep's -a prevents grep from printing something like "Binary file 
matches", so it's probably wise in that particular situation.

>
> (7) Finally, because my system just does a massive write when I 
> periodically rebuild my bogofilter datbases - and so then my system 
> rarely does any writes to the production db - and when it does, it's a 
> "one-writer" situation - THEREFORE - I've found that staying with 
> Berkeley and compiling it with "--disable-transactions" - that so far 
> seems to be the best option for my system. But I totally recognize 
> that switching to Sqlite3 with transactions - is the best option for 
> the default settings for Bogofilter. So you made the best decisions - 
> but I just wanted you to know that this is at least one vote in favor 
> of keeping both the option to use Berkely and the option for 
> "--disable-transactions" when complied - so please don't remove those 
> options/features! Thanks! (btw, related to this, I think setting that 
> the "db_transaction=no " setting in the bogofilter.cf file - has been 
> disabled in the new version? So that motivated me to mention these 
> things here.)

That --disable-transactions at configure time is a sledge hammer.  We 
have a finer tool: You can just add --db-transaction=no when using 
bogoutil -l, or when running bogofilter for the first time to register 
spam or ham with no pre-existing database. (that's specific to Berkeley 
DB use).

Regards,
Matthias