some performance numbers
David Relson
relson at osagesoftware.com
Tue Feb 25 19:20:46 CET 2003
Greetings,
Those of you who have been following the mailing lists are aware that work
is being done on bogofilter's token analyzing. Greg Louis encountered some
performance issues with some large emails. Below is some info on those
emails and some performance numbers from several variants of bogofilter's
lexer.
File 2.txt has a 5MB .doc file as an attachment.
File 3.txt is (basically) 100,000 letter x's as a quoted printable mime
document.
File 4.txt is (basically) file 3.txt, but with 600,000 x's instead of 100,000.
Lexer variants:
1 - cvs - current - 02/25 12:00
2 - njs- lexer buffer swapping
3 - dmr - unified lexer
4 - dmr - unified lexer with special handling for long tokens
Here are some performance numbers for the 3 files and the 4 lexer variants:
(1) (2) (3) (4)
2.txt 11.26 11.26 9.32 9.30
3.txt 5.74 5.72 4.67 0.04
4.txt 153.47 153.45 124.25 0.21
While lexer 2's numbers don't presently show an advantage, they are of
special interest because Nick Simicich is working to use flex's "batch"
mode (instead of the current interactive mode) and to revise the lexer
rules to avoid "backups" (which are slow). Successful use of batch mode
and removal of backups has the promise of significant speed improvements.
Lexer 3 is 20% faster than 1 for unknown reasons, i.e. I've not been
motivated to find the reason for the improvement.
Lexer 4 has special C code to check for alphanumeric sequences that exceed
the maximum allowed token length and to discard them before the lexer has
to process them.
Lexers 3 & 4 can, likely, also take advantage of batch mode. That
experiment is pending.
And that's all the news that's fit to print :-)
David
More information about the bogofilter-dev
mailing list