performance - the simple solution

Sat Feb 22 22:06:39 CET 2003

Nick,

Below are some timing tests I ran.

Test #1 uses the standard token pattern.  This provides the base for speed 
comparisons.

Test #2 has the lexer limit token size to 100 characters.  The results are 
about 15% faster than the base times.

Test #3 does a simple check for a long series of letters and numbers and 
the beginning of each line.  If the length exceeds MAXTOKENLEN, the 
characters are discarded.  For the two messages with 1000's of x's it's 
much faster - better than 99% faster.  For the 5MB message, it's time is 
the same as #1 (appropriate since it has the standard token pattern).  A 
patch for adding the check_pattern() function to lexer.c is attached.

While this special check for overly long tokens (OLTs) is not elegant as 
having flex do the job, it produces the desired result and is significantly 
faster.  The attached patch includes the proof-of-concept function, which 
only deals with OLT's at the beginning of the line.  For actual use, it 
needs to be enhanced, for example to find and discard OLT's anywhere in the 
line.

David

*** 1 ***
TOKEN 
[^[:blank:][:cntrl:][:digit:][:punct:]][^[:blank:]<>;=():&%$#@!+|/\\{}\[\]^\"\?\*,[:cntrl:]]+[^[:blank:][:punct:][:cntrl:]]
2.txt	 10.89 user
3.txt	  5.55 user
4.txt	148.08 user

*** 2 ***
TOKEN 
[^[:blank:][:cntrl:][:digit:][:punct:]][^[:blank:]<>;=():&%$#@!+|/\\{}\[\]^\"\?\*,[:cntrl:]]{2,50}[^[:blank:][:punct:][:cntrl:]]
2.txt	  9.43 user
3.txt	  4.67 user
4.txt	126.16 user

*** 3 ***
using check_alphanum()
2.txt	10.86 user
3.txt	0.04 user
4.txt	0.22 user
-------------- next part --------------
Index: lexer.c
===================================================================
RCS file: /cvsroot/bogofilter/bogofilter/src/lexer.c,v
retrieving revision 1.5
diff -u -r1.5 lexer.c

--- lexer.c	20 Feb 2003 13:17:47 -0000	1.5
+++ lexer.c	22 Feb 2003 20:41:43 -0000
@@ -53,6 +54,27 @@
 	fputc('\n', dbgout);
 }
 
+void check_alphanum(buff_t *buff)
+{
+    size_t e = 0;
+    size_t i;
+    size_t l = buff->t.leng;
+    byte *txt = buff->t.text;
+
+    if (l < MAXTOKENLEN)
+	return;
+    for (i = 0; i < buff->t.leng; i += 1) {
+	if (isalnum((char) txt[i]))
+	    e = i;
+	else
+	    break;
+    }
+    if (e > MAXTOKENLEN) {
+	memcpy(txt, txt+e, l-e);
+	buff->t.leng -= e;
+    }
+}
+
 bool is_from(word_t *w)
 {
     return (w->leng >= 5 && memcmp(w->text, "From ", 5) == 0);
@@ -200,7 +222,8 @@
 	/*change buffer size only if the decoding worked */
 	if (decoded_count != 0 && decoded_count < count) {
 	    buff->t.leng = count = decoded_count;
-	    memcpy(buf, buff->t.text, count);
+	    check_alphanum(buff);
+	    memcpy(buf, buff->t.text, buff->t.leng);
 	    if (DEBUG_LEXER(1)) 
 		lexer_display_buffer(buff);
 	}