switching between different databases - in 1.3.0.rc1

Thu May 29 20:22:04 CEST 2025

>From "Matthias Andree" <matthias.andree at gmx.de>
>Should you decide to do anything of profiling/performance metrics and you identify hot spots or I/O slowdowns somewhere, please share your findings.

Matthias,

So as I was starting to do some performance testing, I noticed a few 
interesting things - and hopefully if you'll deem some of my resulting 
suggestions worthy enough to be acted upon? ...perhaps making it into 
RC2?

One of the things that I've noticed is that RC1 extracts MUCH MORE 
additional types of tokens than did 1.2.5 - especially with certain 
types of emails - and that is overall EXCELLENT - but it does come with 
caveats/concerns. So I think this can potentially cause performance 
issues? But I'm STILL very glad for this additional and helpful data - 
so please don't remove or reverse this additional data - I just think 
think there are some helpful workarounds that might in SOME cases help 
mitigate that if/when such issues occur (or this also might lead to some 
performance optimizations too?)

Here are some obervations:

(1) Even for a freshly-generated database where legit and spam emails 
from directories were processed, and the wordlist file was missing 
before doing that - even then - it's interesting how these two commands 
below signicantly reduce the file size - and in some scenarios - and in 
some cursory testing - this alone significantly improved performance on 
individual message scans. (But that might be more due to the VPS not 
having sufficient RAM to begin with - but the sys I tested this on 
should have had plenty of RAM to prevent that from being an issue - but 
I probably should have tested this with a VPS with an obscene amount of 
extra RAM and/or dived deep into the RAM usage to be sure - I didn't do 
that.) But, at the least, this by itself is worthy of further 
investigation. (a) how much does this reduce file size and RAM on a 
freshly-generated db, and (b) how much does this alone speed up message 
scans?
bogoutil -d ~/.bogofilter/wordlist.db | bogoutil -l  
~/.bogofilter/wordlist_temp.db
mv /root/.bogofilter/wordlist_temp.db /root/.bogofilter/wordlist.db
(2) As a side note - this use of bogoutil for filtering/altering of the 
data - runs order of magnitudes faster in 1.3.0.rc1 than it did in 1.2.5 
- I didn't test it scientifically - but just from human experience, a 
comparable database that takes about 10 seconds to do this in 1.2.5, 
seems to do this conversation (above) in about 1/2 a second in 1.3.0.rc1 
- EXCELLENT!

(3) This extra data I mentioned in point 1 (the additional/new types of 
tokens collected) can extremely bloat the size of the database - which 
then bloats RAM usage - and again - I'm seeing indications that this can 
potentially significantly impact performance (for whatever reasons, 
everone's "mileage may vary" of course!). Therefore, I've found that 
removing all tokens from the database that didn't have at least 2+ hits 
in at least one of the spam or legit categories - helps to SIGNIFICANTLY 
reduce the size of data overall, and while I admit that this probably 
will reduce effectiveness, there's a strong argument that the increased 
effectiveness of the new types of tokens now included in 1.3.0.rc1 - 
more than makes up for the reduced effectiveness effectiveness caused by 
running these two commands (below), which greatly reduce the size of the 
database (and thus reduce RAM usage). Also, keep in mind that the 
faster/slower performance (depending on the situation) might have OTHER 
causes besides raw RAM usage!
bogoutil -d ~/.bogofilter/wordlist.db | awk '$2 > 1 || $3 > 1' bogoutil 
-l  ~/.bogofilter/wordlist_temp.db
mv /root/.bogofilter/wordlist_temp.db /root/.bogofilter/wordlist.db

(4) So basically, in 1.2.5, one strategy I found effective was to have a 
completely separate 2nd Bogofilter implementation which uses the 
following (controversial?) settings:

min-token-len=2
max-token-len=48
multi-token-count=3
max-multi-token-len=48

And this most definately already bloats the size of the database in 
1.2.5 (but not due to a bug - such bloating is suppose to occur based on 
these settings!) - so then I used that same awk '$2 > 1 || $3 > 1' 
filtering stragegy to combine this multi-token strategy with removing 
one-off "unique" hits. It was very successful. The problem? So when 
combining this multi-word strategy with ALSO trying to do this on 
1.3.0.rc1 - which also bloats the data with additional types of tokens 
that 1.2.5 never included - then the bloating (and processing time and 
file sizes!) goes absoutely "parabolic" and sort "multiples" - and then 
- this causes bogofilter's processing of large message stores - that 
previously took something like 20 minutes to process - that now appears 
to be requiring potentially DAYS OF TIME - and probably 10s or 100s of 
gigabytes to get there - who knows - but during testing - it was insane. 
So doing this multi-word strategy is simply unworkable on 1.3.0.rc1 (and 
least for most situations/setups). So I guess for now I'm stuck keeping 
this part of my filtering process on 1.2.5 ?

So related to this - while I absolutely do want the new Bogofilter to 
continue to include these new types of tokens it's finding in the new 
version - please don't remove that feature - that is overall an 
excellent improvement - but I do have a feature request - please 
consider adding a setting or command line option - to OPTIONALLY revert 
that back - perhaps even have it be based on levels. such as this:

# TokenCollectionLvl 1 # similar to 1.2.5's in collecting of tokens
# TokenCollectionLvl 2 # a compromise between 1.2.5 and the new 
version's more aggressive collecting of tokens
TokenCollectionLvl 3 # DEFAULT- uses the new version's collecting of 
additional types of tokens

(+ command line option for that?)

Or something like that?

(5) btw - this applies to all versions - I'm finding that - during 
regular usage of Bogofilter - nothing exotic - so using the default 
min-token-len=3 setting - those tokens which start with "head" but are 
only 3 in length - overall cause too many False Positives to be 
worthwhile. But changing this to setting 4-in-length would then cause 
too many False Negatives.  So I use that import/export to fix this, 
where I added this little gem:

grep -avE '^head:.{3}\s'

So then instead of the following:

bogoutil -d ~/.bogofilter/wordlist.db | awk '$2 > 1 || $3 > 1' | 
bogoutil -l  ~/.bogofilter/wordlist_temp.db

I instead run:

bogoutil -d ~/.bogofilter/wordlist.db | awk '$2 > 1 || $3 > 1' | grep 
-avE '^head:.{3}\s' | bogoutil -l  ~/.bogofilter/wordlist_temp.db

(6) Note that my use of the "-a" option in grep here - might also be 
filtering some thing out - which in some cases really should be there? 
So that's another factor that probably deserves some extra research - I 
just wanted everyone who reads this email and might try these things - 
to be aware of that and to be sure to research that part for themselves.

(7) Finally, because my system just does a massive write when I 
periodically rebuild my bogofilter datbases - and so then my system 
rarely does any writes to the production db - and when it does, it's a 
"one-writer" situation - THEREFORE - I've found that staying with 
Berkeley and compiling it with "--disable-transactions" - that so far 
seems to be the best option for my system. But I totally recognize that 
switching to Sqlite3 with transactions - is the best option for the 
default settings for Bogofilter. So you made the best decisions - but I 
just wanted you to know that this is at least one vote in favor of 
keeping both the option to use Berkely and the option for 
"--disable-transactions" when complied - so please don't remove those 
options/features! Thanks! (btw, related to this, I think setting that 
the "db_transaction=no " setting in the bogofilter.cf file - has been 
disabled in the new version? So that motivated me to mention these 
things here.)

PS - when I say "RAM usage" - to be technical - since Bogofilter doesn't 
run as a service or Daemon - that mostly translates to a system's 
ability to FULLY cache Bogofilter data in RAM (so please, no lectures on 
that from anyone!)

I hope this all helps and makes sense. Thanks again for all you do for 
Bogofilter

Rob McEwen, invaluement