merge wordlist databases

BCDVTV at bitinfo.hu BCDVTV at bitinfo.hu
Fri Jan 24 11:36:39 CET 2025


Hi,

i'm planning to run bogofilter on multiple machines but on the same set of emails, ie. on one mailbox, but the user sometimes finds spam on one machine, classifies it there, sometimes do this on an other one, maintaining the wordlist database there. connectivity between nodes is not guaranteed to be always-on, but occasional. it's also possible for the user to handle the same spam email on both machines.

so i'm searching for a method to merge independently changing wordlist databases.
I use the sqlite backend BTW.

i've found Christos Chatzaras chris at cretaforce.gr 's thread on this mail list on Sun Oct 1 11:58:07 CEST 2017.
so I tested what bogoutil --dump and --load do, dumped 2 wordlists, and merged the dumps into 1, 
then looked up 2 random words, one (Zoll-Fettreifen) with the same number attached to it (I don't exactly know what this number is for – maybe a simple counter), and the other one (ZzC5YWj.png) with different numbers across the 2 original wordlists.

$ grep Zoll-Fettreifen merged bogodump-1 bogodump-2
merged:Zoll-Fettreifen 2 0 20250124
bogodump-1:Zoll-Fettreifen 1 0 20250108
bogodump-2:Zoll-Fettreifen 1 0 20250124

$ grep ZzC5YWj.png merged bogodump-1 bogodump-2
merged:ZzC5YWj.png 17 0 20250124
bogodump-1:ZzC5YWj.png 6 0 20250123
bogodump-2:ZzC5YWj.png 11 0 20250124


so it adds up the numbers and takes the latest timestamp (also, why is it at day precision?)

my question is, how accurate the merged database does become? why the numbers are added up? what if "Zoll-Fettreifen" got into both wordlists from the same source? would not it distorts the scores like words like this would have been occured twice as much? i guess the "corrent" counter would be 1 in that case.
maybe bogoutil --load assumes that each input measures a disjunct set of emails?

would it be more fit to my case to maintain N number of "local" wordlists each of which having data about a unique set of  emails reported as spam by the user; plus a "global" merged wordlist into which the local wordlists from the other nodes get loaded once in a while, and use this one when auto-cassifying new emails (bogofilter -T) ?

or just sync the spam folder itself and feed new emails in it to "bogofilter -s" (as those are marked as spam by the user) ? i'm not very keen doing this because i don't want spams to "live" past the moment when they are moved to the trash, let alone taking network bandwidth.

thanks you've read it.

--
András


More information about the bogofilter mailing list