cdb preliminary results
Greg Louis
glouis at dynamicro.on.ca
Thu Jul 10 16:55:48 CEST 2003
Exec summary: this looks promising.
For experimental purposes, I chopped bogofilter into 3 chunks: a
tokenizer (roughly equivalent to bogolexer -p except that when it finds
a 'From ' at the beginning of a line it emits a blank line and keeps
going), a lookup module that takes the tokenizer output and uses the
wordlist to build what David calls "msg-count" format, and a classifier
that reads msg-count format and emits spam scores (by default, in
bogofilter terse format). The classifier, by default, expects
mbox-format input and spawns tokenizer and lookup processes to convert
to msg-count; but if msg-count files are available, there's an option
to skip that and read msg-count directly.
It was therefore extremely easy to try cdb; all I had to do was rewrite
the lookup module. The first test just classified 550 spams:
cdb db
real 0m2.600s 0m5.279s
user 0m0.720s 0m3.100s
sys 0m0.050s 0m0.400s
I made an off-by-one (order of magnitude;) error in my last post, when I
said my 30-Mb db wordlist shrank to 4M; should have realized that that
wasn't possible. (David will gloat: he always tells me I should use ls
-h or ls -K instead of trying to read sizes in bytes :) In fact:
35M Jul 10 10:25 wordlist.cdb
30M Jul 10 07:32 wordlist.db
Implementing cdb in the form of a datastore_cdb file is going to need
some bending and twisting, I think, because of the need to rebuild the
db every time you write to it. This is an argument in favour of
keeping the dbt_v structure, because when the field size is constant
you can update in place -- if only counts or timestamp need to change,
it's doable. As soon as you need a new token, though, a full rewrite
is required; I suspect that that's all the time, since my training db
has a lot of what look like message-ID tokens.
--
| G r e g L o u i s | gpg public key: finger |
| http://www.bgl.nu/~glouis | glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |
More information about the bogofilter-dev
mailing list