switching between different databases - in 1.3.0.rc1
Rob McEwen
rob at invaluement.com
Thu May 29 20:22:04 CEST 2025
>From "Matthias Andree" <matthias.andree at gmx.de>
>Should you decide to do anything of profiling/performance metrics and you identify hot spots or I/O slowdowns somewhere, please share your findings.
Matthias,
So as I was starting to do some performance testing, I noticed a few
interesting things - and hopefully if you'll deem some of my resulting
suggestions worthy enough to be acted upon? ...perhaps making it into
RC2?
One of the things that I've noticed is that RC1 extracts MUCH MORE
additional types of tokens than did 1.2.5 - especially with certain
types of emails - and that is overall EXCELLENT - but it does come with
caveats/concerns. So I think this can potentially cause performance
issues? But I'm STILL very glad for this additional and helpful data -
so please don't remove or reverse this additional data - I just think
think there are some helpful workarounds that might in SOME cases help
mitigate that if/when such issues occur (or this also might lead to some
performance optimizations too?)
Here are some obervations:
(1) Even for a freshly-generated database where legit and spam emails
from directories were processed, and the wordlist file was missing
before doing that - even then - it's interesting how these two commands
below signicantly reduce the file size - and in some scenarios - and in
some cursory testing - this alone significantly improved performance on
individual message scans. (But that might be more due to the VPS not
having sufficient RAM to begin with - but the sys I tested this on
should have had plenty of RAM to prevent that from being an issue - but
I probably should have tested this with a VPS with an obscene amount of
extra RAM and/or dived deep into the RAM usage to be sure - I didn't do
that.) But, at the least, this by itself is worthy of further
investigation. (a) how much does this reduce file size and RAM on a
freshly-generated db, and (b) how much does this alone speed up message
scans?
bogoutil -d ~/.bogofilter/wordlist.db | bogoutil -l
~/.bogofilter/wordlist_temp.db
mv /root/.bogofilter/wordlist_temp.db /root/.bogofilter/wordlist.db
(2) As a side note - this use of bogoutil for filtering/altering of the
data - runs order of magnitudes faster in 1.3.0.rc1 than it did in 1.2.5
- I didn't test it scientifically - but just from human experience, a
comparable database that takes about 10 seconds to do this in 1.2.5,
seems to do this conversation (above) in about 1/2 a second in 1.3.0.rc1
- EXCELLENT!
(3) This extra data I mentioned in point 1 (the additional/new types of
tokens collected) can extremely bloat the size of the database - which
then bloats RAM usage - and again - I'm seeing indications that this can
potentially significantly impact performance (for whatever reasons,
everone's "mileage may vary" of course!). Therefore, I've found that
removing all tokens from the database that didn't have at least 2+ hits
in at least one of the spam or legit categories - helps to SIGNIFICANTLY
reduce the size of data overall, and while I admit that this probably
will reduce effectiveness, there's a strong argument that the increased
effectiveness of the new types of tokens now included in 1.3.0.rc1 -
more than makes up for the reduced effectiveness effectiveness caused by
running these two commands (below), which greatly reduce the size of the
database (and thus reduce RAM usage). Also, keep in mind that the
faster/slower performance (depending on the situation) might have OTHER
causes besides raw RAM usage!
bogoutil -d ~/.bogofilter/wordlist.db | awk '$2 > 1 || $3 > 1' bogoutil
-l ~/.bogofilter/wordlist_temp.db
mv /root/.bogofilter/wordlist_temp.db /root/.bogofilter/wordlist.db
(4) So basically, in 1.2.5, one strategy I found effective was to have a
completely separate 2nd Bogofilter implementation which uses the
following (controversial?) settings:
min-token-len=2
max-token-len=48
multi-token-count=3
max-multi-token-len=48
And this most definately already bloats the size of the database in
1.2.5 (but not due to a bug - such bloating is suppose to occur based on
these settings!) - so then I used that same awk '$2 > 1 || $3 > 1'
filtering stragegy to combine this multi-token strategy with removing
one-off "unique" hits. It was very successful. The problem? So when
combining this multi-word strategy with ALSO trying to do this on
1.3.0.rc1 - which also bloats the data with additional types of tokens
that 1.2.5 never included - then the bloating (and processing time and
file sizes!) goes absoutely "parabolic" and sort "multiples" - and then
- this causes bogofilter's processing of large message stores - that
previously took something like 20 minutes to process - that now appears
to be requiring potentially DAYS OF TIME - and probably 10s or 100s of
gigabytes to get there - who knows - but during testing - it was insane.
So doing this multi-word strategy is simply unworkable on 1.3.0.rc1 (and
least for most situations/setups). So I guess for now I'm stuck keeping
this part of my filtering process on 1.2.5 ?
So related to this - while I absolutely do want the new Bogofilter to
continue to include these new types of tokens it's finding in the new
version - please don't remove that feature - that is overall an
excellent improvement - but I do have a feature request - please
consider adding a setting or command line option - to OPTIONALLY revert
that back - perhaps even have it be based on levels. such as this:
# TokenCollectionLvl 1 # similar to 1.2.5's in collecting of tokens
# TokenCollectionLvl 2 # a compromise between 1.2.5 and the new
version's more aggressive collecting of tokens
TokenCollectionLvl 3 # DEFAULT- uses the new version's collecting of
additional types of tokens
(+ command line option for that?)
Or something like that?
(5) btw - this applies to all versions - I'm finding that - during
regular usage of Bogofilter - nothing exotic - so using the default
min-token-len=3 setting - those tokens which start with "head" but are
only 3 in length - overall cause too many False Positives to be
worthwhile. But changing this to setting 4-in-length would then cause
too many False Negatives. So I use that import/export to fix this,
where I added this little gem:
grep -avE '^head:.{3}\s'
So then instead of the following:
bogoutil -d ~/.bogofilter/wordlist.db | awk '$2 > 1 || $3 > 1' |
bogoutil -l ~/.bogofilter/wordlist_temp.db
I instead run:
bogoutil -d ~/.bogofilter/wordlist.db | awk '$2 > 1 || $3 > 1' | grep
-avE '^head:.{3}\s' | bogoutil -l ~/.bogofilter/wordlist_temp.db
(6) Note that my use of the "-a" option in grep here - might also be
filtering some thing out - which in some cases really should be there?
So that's another factor that probably deserves some extra research - I
just wanted everyone who reads this email and might try these things -
to be aware of that and to be sure to research that part for themselves.
(7) Finally, because my system just does a massive write when I
periodically rebuild my bogofilter datbases - and so then my system
rarely does any writes to the production db - and when it does, it's a
"one-writer" situation - THEREFORE - I've found that staying with
Berkeley and compiling it with "--disable-transactions" - that so far
seems to be the best option for my system. But I totally recognize that
switching to Sqlite3 with transactions - is the best option for the
default settings for Bogofilter. So you made the best decisions - but I
just wanted you to know that this is at least one vote in favor of
keeping both the option to use Berkely and the option for
"--disable-transactions" when complied - so please don't remove those
options/features! Thanks! (btw, related to this, I think setting that
the "db_transaction=no " setting in the bogofilter.cf file - has been
disabled in the new version? So that motivated me to mention these
things here.)
PS - when I say "RAM usage" - to be technical - since Bogofilter doesn't
run as a service or Daemon - that mostly translates to a system's
ability to FULLY cache Bogofilter data in RAM (so please, no lectures on
that from anyone!)
I hope this all helps and makes sense. Thanks again for all you do for
Bogofilter
Rob McEwen, invaluement
More information about the bogofilter
mailing list