Plan for performance improvement

Sat Sep 14 16:49:10 CEST 2002

<x-flowed>
At 04:06 AM 9/14/02, Eric Seppanen wrote:
>On Fri, Sep 13, 2002 at 06:52:35PM -0700, Adrian Otto wrote:
>
> > Proposed Change:
> > I propose we use a double-linked list to hold the top 15 "significant"
> > words. We then keep the list sorted by ascending "prob" values. Then we use
> > a hash table that  lets us track which words we have already evaluated to
> > prevent the need to evaluate them multiple times when they repeat in the
> > message.
>
>A variation on this theme would be to build a hash table or word list
>first, then once the whole message has been parsed, process the whole
>list.  This would also eliminate the redundant lookup of duplicate words
>in the spamlist/hamlist.

Good morning Eric,

Seems like the perfect tool for a word list is already present - the Judy 
library.  Using it would change the algorithm somewhat.  Currently, 
bogofilter loops using get_token() to parse the message and then does the 
spamicity calculations for each word.  The modification would be to use the 
Judy library inside the loop to create the word list, then loop through the 
word list to calculate spamicity.

As I write this, it doesn't seem like a big task.  However, I do have other 
things to do today, so someone else can reap the glory, if so desired.

David

</x-flowed>