performance - the simple solution
David Relson
relson at osagesoftware.com
Sat Feb 22 22:06:39 CET 2003
Nick,
Below are some timing tests I ran.
Test #1 uses the standard token pattern. This provides the base for speed
comparisons.
Test #2 has the lexer limit token size to 100 characters. The results are
about 15% faster than the base times.
Test #3 does a simple check for a long series of letters and numbers and
the beginning of each line. If the length exceeds MAXTOKENLEN, the
characters are discarded. For the two messages with 1000's of x's it's
much faster - better than 99% faster. For the 5MB message, it's time is
the same as #1 (appropriate since it has the standard token pattern). A
patch for adding the check_pattern() function to lexer.c is attached.
While this special check for overly long tokens (OLTs) is not elegant as
having flex do the job, it produces the desired result and is significantly
faster. The attached patch includes the proof-of-concept function, which
only deals with OLT's at the beginning of the line. For actual use, it
needs to be enhanced, for example to find and discard OLT's anywhere in the
line.
David
*** 1 ***
TOKEN
[^[:blank:][:cntrl:][:digit:][:punct:]][^[:blank:]<>;=():&%$#@!+|/\\{}\[\]^\"\?\*,[:cntrl:]]+[^[:blank:][:punct:][:cntrl:]]
2.txt 10.89 user
3.txt 5.55 user
4.txt 148.08 user
*** 2 ***
TOKEN
[^[:blank:][:cntrl:][:digit:][:punct:]][^[:blank:]<>;=():&%$#@!+|/\\{}\[\]^\"\?\*,[:cntrl:]]{2,50}[^[:blank:][:punct:][:cntrl:]]
2.txt 9.43 user
3.txt 4.67 user
4.txt 126.16 user
*** 3 ***
using check_alphanum()
2.txt 10.86 user
3.txt 0.04 user
4.txt 0.22 user
-------------- next part --------------
Index: lexer.c
===================================================================
RCS file: /cvsroot/bogofilter/bogofilter/src/lexer.c,v
retrieving revision 1.5
diff -u -r1.5 lexer.c
--- lexer.c 20 Feb 2003 13:17:47 -0000 1.5
+++ lexer.c 22 Feb 2003 20:41:43 -0000
@@ -53,6 +54,27 @@
fputc('\n', dbgout);
}
+void check_alphanum(buff_t *buff)
+{
+ size_t e = 0;
+ size_t i;
+ size_t l = buff->t.leng;
+ byte *txt = buff->t.text;
+
+ if (l < MAXTOKENLEN)
+ return;
+ for (i = 0; i < buff->t.leng; i += 1) {
+ if (isalnum((char) txt[i]))
+ e = i;
+ else
+ break;
+ }
+ if (e > MAXTOKENLEN) {
+ memcpy(txt, txt+e, l-e);
+ buff->t.leng -= e;
+ }
+}
+
bool is_from(word_t *w)
{
return (w->leng >= 5 && memcmp(w->text, "From ", 5) == 0);
@@ -200,7 +222,8 @@
/*change buffer size only if the decoding worked */
if (decoded_count != 0 && decoded_count < count) {
buff->t.leng = count = decoded_count;
- memcpy(buf, buff->t.text, count);
+ check_alphanum(buff);
+ memcpy(buf, buff->t.text, buff->t.leng);
if (DEBUG_LEXER(1))
lexer_display_buffer(buff);
}
More information about the bogofilter-dev
mailing list