bogofilter databases

Tue Jul 15 23:14:52 CEST 2003

Greetings,

Life is interesting!  When I left on vacation a week or so back, bogofilter 
was quietly stable, the mailing list was virtually dormant, and some 
database work was at the top of my project list.  While I was gone, there 
was a flurry of activity, sparked by Gyepi's addition of tdb (tridge's tiny 
database) support and Greg's interest in cdb (djb's constant database).  If 
my vacations will spur further bogofilter activity, I'll gladly take more 
of them!

As you all know, bogofilter currently uses BerkeleyDB to manage separate 
wordlists for spam and ham tokens.  These are the well known spamlist.db 
and goodlist.db files.  Likely you'll recall that Greg and I did some work 
to create a version of bogofilter that uses a single, combined wordlist for 
storing all tokens.

In the multiple (separate) wordlist version, each token is stored in the 
appropriate (ham or spam) wordlist along with its count and (optionally) a 
timestamp.   Tokens which occur in both ham and spam messages are in both 
wordlists (thus duplicating some disk usage).

In the single (combined) wordlist version, each token appears in the 
wordlist once, along with two counts (for ham and spam), and (optionally) 
the timestampl.

The major advantage of single wordlist bogofilter is that only one wordlist 
needs to be searched for each token.  This provides improved lookup speed, 
thus speeding up bogofilter.

The major disadvantage is that BerkeleyDB's performance is closely tied to 
the size of its cache.  The default cache size works well for multiple 
wordlist bogofilter, but not for single wordlist bogofilter.  A cache of 
several megabytes provides the needed space, at the expense of some ram.

The disk space needed for the two database versions is comparable.  It 
seems that relatively few tokens appear in both ham and spam, so the space 
needed for the ham and spam counts pretty much uses up the space that is 
saved by combining ham and spam tokens.

Anyhow, enough history - time for some status info.

My development version of bogofilter can operate in either single wordlist 
or multiple wordlist mode.  The value of a single global variable 
determines the mode.  The cvs repository will be updated with this code in 
the next day or so.

The single/multiple wordlist code has some rough edges.  It works, but 
doesn't seem polished.  Suggestions for improving it will be welcomed :-)

After the cvs update, the next step is for bogofilter to check whether 
BOGOFILTER_DIR contains 1 wordlist or 2 wordlists and to operate in the 
appropriate mode.

After that comes the merging Gyepi's structural changes (triggered by his 
tdb work), the inclusion of the tdb code, and release 
bogofilter-0.14.0.  If the cdb work is ready, that will also be merged in.

... and that's all I currently know about bogofilter and databases.

David