Plan for performance improvement
Eric Seppanen
eds at reric.net
Sat Sep 14 10:06:31 CEST 2002
On Fri, Sep 13, 2002 at 06:52:35PM -0700, Adrian Otto wrote:
> Proposed Change:
> I propose we use a double-linked list to hold the top 15 "significant"
> words. We then keep the list sorted by ascending "prob" values. Then we use
> a hash table that lets us track which words we have already evaluated to
> prevent the need to evaluate them multiple times when they repeat in the
> message.
A variation on this theme would be to build a hash table or word list
first, then once the whole message has been parsed, process the whole
list. This would also eliminate the redundant lookup of duplicate words
in the spamlist/hamlist.
If you'll permit me to play devil's advocate, I have a few concerns:
- Sounds a bit complicated. Maybe too early in bogofilter's life for
optimization?
- Are string compares really that cpu intensive? Remember that roughly
96% of them will fail on the first character, 99.9 percent will have
failed after the second character...
- Building and checking a hash table on the fly may eat up any
performance gain, and then some.
- Big messages might eat up a nontrivial amount of memory. Right now
bogofilter uses no extra memory for large messages.
Maybe profiling bogofilter in action would tell us more about where the
code spends its time. We should know which bits should be potential
targets for optimization, and try to keep the rest of the code as simple
and readable as possible.
More information about the bogofilter-dev
mailing list