limiting token length [was: Performance issues....and ugly news.]
David Relson
relson at osagesoftware.com
Sat Feb 22 22:17:43 CET 2003
At 04:01 PM 2/22/03, Matthias Andree wrote:
> > Idea 2 - the current lexer works adequately for the vast majority of
> > messages. What it doesn't handle satisfactorily is humongously long
> > strings of characters which match a rule, hence might be a valid
> > token.
>
>I wonder if adding {1,30} or something helps. It chops long strings into
>many small strings though. Tracking the state (i. e. don't return TOKEN
>if the previous TOKEN wasn't separated by a delimiter) might fix this.
Matthias,
See my test results (from a few minutes back). I don't think it's
necessary to chop and keep track of the state. Bogofilter's algorithm
already discards overly long tokens (OLT's). If the lexer uses a large max
length (say 40 or 50), it generates tokens which bogofilter will
discard. That test showed a 15% savings.
My second test where OLT's are detected and discarded without the lexer
seeing them was much faster - approx 99% for the two pathological test cases.
David
More information about the bogofilter-dev
mailing list