Plan for performance improvement

Sat Sep 14 10:06:31 CEST 2002

On Fri, Sep 13, 2002 at 06:52:35PM -0700, Adrian Otto wrote:

> Proposed Change:
> I propose we use a double-linked list to hold the top 15 "significant"
> words. We then keep the list sorted by ascending "prob" values. Then we use
> a hash table that  lets us track which words we have already evaluated to
> prevent the need to evaluate them multiple times when they repeat in the
> message.

A variation on this theme would be to build a hash table or word list 
first, then once the whole message has been parsed, process the whole 
list.  This would also eliminate the redundant lookup of duplicate words 
in the spamlist/hamlist.

If you'll permit me to play devil's advocate, I have a few concerns:

- Sounds a bit complicated.  Maybe too early in bogofilter's life for 
optimization?
- Are string compares really that cpu intensive?  Remember that roughly 
96% of them will fail on the first character, 99.9 percent will have 
failed after the second character...
- Building and checking a hash table on the fly may eat up any 
performance gain, and then some.
- Big messages might eat up a nontrivial amount of memory.  Right now 
bogofilter uses no extra memory for large messages.

Maybe profiling bogofilter in action would tell us more about where the 
code spends its time.  We should know which bits should be potential 
targets for optimization, and try to keep the rest of the code as simple 
and readable as possible.