some performance numbers

Tue Feb 25 19:20:46 CET 2003

Greetings,

Those of you who have been following the mailing lists are aware that work 
is being done on bogofilter's token analyzing.  Greg Louis encountered some 
performance issues with some large emails.  Below is some info on those 
emails and some performance numbers from several variants of bogofilter's 
lexer.

File 2.txt has a 5MB .doc file as an attachment.
File 3.txt is (basically) 100,000 letter x's as a quoted printable mime 
document.
File 4.txt is (basically) file 3.txt, but with 600,000 x's instead of 100,000.

Lexer variants:

1 - cvs - current - 02/25 12:00
2 - njs- lexer buffer swapping
3 - dmr - unified lexer
4 - dmr - unified lexer with special handling for long tokens

Here are some performance numbers for the 3 files and the 4 lexer variants:

           (1)     (2)     (3)   (4)
2.txt    11.26   11.26    9.32  9.30
3.txt     5.74    5.72    4.67  0.04
4.txt   153.47  153.45  124.25  0.21

While lexer 2's numbers don't presently show an advantage, they are of 
special interest because Nick Simicich is working to use flex's "batch" 
mode (instead of the current interactive mode) and to revise the lexer 
rules to avoid "backups" (which are slow).  Successful use of batch mode 
and removal of backups has the promise of significant speed improvements.

Lexer 3 is 20% faster than 1 for unknown reasons, i.e. I've not been 
motivated to find the reason for the improvement.

Lexer 4 has special C code to check for alphanumeric sequences that exceed 
the maximum allowed token length and to discard them before the lexer has 
to process them.

Lexers 3 & 4 can, likely, also take advantage of batch mode.  That 
experiment is pending.

And that's all the news that's fit to print :-)

David