limiting token length [was: Performance issues....and ugly news.]

David Relson relson at osagesoftware.com
Sat Feb 22 22:17:43 CET 2003


At 04:01 PM 2/22/03, Matthias Andree wrote:

> > Idea 2 - the current lexer works adequately for the vast majority of
> > messages.  What it doesn't handle satisfactorily is humongously long
> > strings of characters which match a rule, hence might be a valid
> > token.
>
>I wonder if adding {1,30} or something helps. It chops long strings into
>many small strings though. Tracking the state (i. e. don't return TOKEN
>if the previous TOKEN wasn't separated by a delimiter) might fix this.

Matthias,

See my test results (from a few minutes back).  I don't think it's 
necessary to chop and keep track of the state.   Bogofilter's algorithm 
already discards overly long tokens (OLT's).  If the lexer uses a large max 
length (say 40 or 50), it generates tokens which bogofilter will 
discard.  That test showed a 15% savings.

My second test where OLT's are detected and discarded without the lexer 
seeing them was much faster - approx 99% for the two pathological test cases.

David





More information about the bogofilter-dev mailing list