cdb preliminary results

Greg Louis glouis at dynamicro.on.ca
Thu Jul 10 16:55:48 CEST 2003


Exec summary: this looks promising.

For experimental purposes, I chopped bogofilter into 3 chunks: a
tokenizer (roughly equivalent to bogolexer -p except that when it finds
a 'From ' at the beginning of a line it emits a blank line and keeps
going), a lookup module that takes the tokenizer output and uses the
wordlist to build what David calls "msg-count" format, and a classifier
that reads msg-count format and emits spam scores (by default, in
bogofilter terse format).  The classifier, by default, expects
mbox-format input and spawns tokenizer and lookup processes to convert
to msg-count; but if msg-count files are available, there's an option
to skip that and read msg-count directly.

It was therefore extremely easy to try cdb; all I had to do was rewrite
the lookup module.  The first test just classified 550 spams:

             cdb           db
real    0m2.600s     0m5.279s
user    0m0.720s     0m3.100s
sys     0m0.050s     0m0.400s

I made an off-by-one (order of magnitude;) error in my last post, when I
said my 30-Mb db wordlist shrank to 4M; should have realized that that
wasn't possible.  (David will gloat: he always tells me I should use ls
-h or ls -K instead of trying to read sizes in bytes :)  In fact:
          35M Jul 10 10:25 wordlist.cdb
          30M Jul 10 07:32 wordlist.db

Implementing cdb in the form of a datastore_cdb file is going to need
some bending and twisting, I think, because of the need to rebuild the
db every time you write to it.  This is an argument in favour of
keeping the dbt_v structure, because when the field size is constant
you can update in place -- if only counts or timestamp need to change,
it's doable.  As soon as you need a new token, though, a full rewrite
is required; I suspect that that's all the time, since my training db
has a lot of what look like message-ID tokens.

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |




More information about the bogofilter-dev mailing list