bogofilter databases
David Relson
relson at osagesoftware.com
Tue Jul 15 23:14:52 CEST 2003
Greetings,
Life is interesting! When I left on vacation a week or so back, bogofilter
was quietly stable, the mailing list was virtually dormant, and some
database work was at the top of my project list. While I was gone, there
was a flurry of activity, sparked by Gyepi's addition of tdb (tridge's tiny
database) support and Greg's interest in cdb (djb's constant database). If
my vacations will spur further bogofilter activity, I'll gladly take more
of them!
As you all know, bogofilter currently uses BerkeleyDB to manage separate
wordlists for spam and ham tokens. These are the well known spamlist.db
and goodlist.db files. Likely you'll recall that Greg and I did some work
to create a version of bogofilter that uses a single, combined wordlist for
storing all tokens.
In the multiple (separate) wordlist version, each token is stored in the
appropriate (ham or spam) wordlist along with its count and (optionally) a
timestamp. Tokens which occur in both ham and spam messages are in both
wordlists (thus duplicating some disk usage).
In the single (combined) wordlist version, each token appears in the
wordlist once, along with two counts (for ham and spam), and (optionally)
the timestampl.
The major advantage of single wordlist bogofilter is that only one wordlist
needs to be searched for each token. This provides improved lookup speed,
thus speeding up bogofilter.
The major disadvantage is that BerkeleyDB's performance is closely tied to
the size of its cache. The default cache size works well for multiple
wordlist bogofilter, but not for single wordlist bogofilter. A cache of
several megabytes provides the needed space, at the expense of some ram.
The disk space needed for the two database versions is comparable. It
seems that relatively few tokens appear in both ham and spam, so the space
needed for the ham and spam counts pretty much uses up the space that is
saved by combining ham and spam tokens.
Anyhow, enough history - time for some status info.
My development version of bogofilter can operate in either single wordlist
or multiple wordlist mode. The value of a single global variable
determines the mode. The cvs repository will be updated with this code in
the next day or so.
The single/multiple wordlist code has some rough edges. It works, but
doesn't seem polished. Suggestions for improving it will be welcomed :-)
After the cvs update, the next step is for bogofilter to check whether
BOGOFILTER_DIR contains 1 wordlist or 2 wordlists and to operate in the
appropriate mode.
After that comes the merging Gyepi's structural changes (triggered by his
tdb work), the inclusion of the tdb code, and release
bogofilter-0.14.0. If the cdb work is ready, that will also be merged in.
... and that's all I currently know about bogofilter and databases.
David
More information about the bogofilter-dev
mailing list