Plan for performance improvement

Sat Sep 14 03:52:35 CEST 2002

Hey,

I think I have discovered a way to significantly speed up the message
processing code. Bogofilter currently uses an array of 15 structures
(stats.extrema) to hold the most "significant" words of the message. This
design is beautiful because of it's simplicity. Fortunately, it also leaves
us some room to improve the performance by using a slightly different data
structure design.

Current Code:
With the existing implementation, an average of 7.5 string comparisons and
multiple data comparisons are required for each word found in the message in
order to locate the word in the stats.extrema array. If words are repeated
in the message, but don't make it into the stats.extrema array, it may cause
the same repeated word to be iterated through the array many times. I'm
interested in finding a way to reduce the number of string comparisons
required for this process, which will speed up bogofilter at the very core
of it's operation.

Proposed Change:
I propose we use a double-linked list to hold the top 15 "significant"
words. We then keep the list sorted by ascending "prob" values. Then we use
a hash table that  lets us track which words we have already evaluated to
prevent the need to evaluate them multiple times when they repeat in the
message.

Potential Benefits:
1) Reduction of strcmp() operations from 7.5 to a single call to the btree
function and a single data comparison to determine if a word is already
present.
2) Repeated words identified by the call to the btree function don't need to
be evaluated against any values in the "top 15" list.
3) Although only a minor benefit, the list will already be sorted, so there
will be no need to call qsort() to format the output for "-v" mode.
4) We can populate the btree with words from a user configurable "ignore"
list, which will both speed up performance, and allow greater message
evaluation accuracy.

Notes:
 - Perhaps we can use an in-memory BerkeleyDB btree for this since we
already have the BerkeleyDB libraries loaded into the program.

Does anybody want to work on this? I'm willing to develop a patch for this,
and benchmark it against the current bogofilter implementation. I'm also
willing to work together with anyone else interested. I'm hoping that Eric
Raymond will comment on this. I think we would all really value his input.

Comments?

Thanks,

Adrian