Plan for performance improvement

Sat Sep 14 18:34:45 CEST 2002

Eric,

Thanks very much for your response. That's exactly the type of feedback I
was looking for.

> A variation on this theme would be to build a hash table or word list
> first, then once the whole message has been parsed, process the whole
> list.  This would also eliminate the redundant lookup of duplicate words
> in the spamlist/hamlist.

Good suggestion. If building the hash/btree in advance has a significant CPU
resource savings, it's probably worth the memory usage tradeoff.
Alternatively, we could simply not use any hash/btree, and only use the
double-linked list. More comments about this follow.

> If you'll permit me to play devil's advocate, I have a few concerns:
>
> - Sounds a bit complicated.  Maybe too early in bogofilter's life for
> optimization?

I agree that simplicity has its advantages. I'm proposing this now because
we started patching this area of the code. I thought I'd suggest this idea
now in case we do want to make this sort of change. We should certainly keep
in mind that sometimes the best option is to leave things alone.

> - Are string compares really that cpu intensive?  Remember that roughly
> 96% of them will fail on the first character, 99.9 percent will have
> failed after the second character...

Well, anything done inside these sorts of loops piques my interest. I just
noticed the strcmp() as one potential area for improvement.

> - Building and checking a hash table on the fly may eat up any
> performance gain, and then some.

If we determine maintaining the hash/btree is too expensive (for memory or
CPU use) we can simply not do that part of the plan, but use the rest of it.
We could still use the double-linked list idea and reduce the number of data
comparisons required to determine if the current word has a higher 'prob'
value than the ones already in the list. We can do this by evaluating the
'prob' value in the item that our list pointer refers to (the head of the
list). If the prob value is not high enough to make it into the array, then
we break out of the loop, and never do a single strcmp() call.

When we need to insert something into the list it's a matter of changing
pointer values around to put it in the right sorted sequence.

> - Big messages might eat up a nontrivial amount of memory.  Right now
> bogofilter uses no extra memory for large messages.

Good point. With my modified approach above, we get the same advantage.

> Maybe profiling bogofilter in action would tell us more about where the
> code spends its time.  We should know which bits should be potential
> targets for optimization, and try to keep the rest of the code as simple
> and readable as possible.

I've done this in the past with Rational's Quantify. I don't have access to
this application any more. What profilers are available in the open source
world? What have you used to do this sort of analysis?

Thanks,

Adrian