length vs speed

Sat Feb 22 05:57:46 CET 2003

Greetings,

I've been thinking about the problem Greg has encountered - that bogofilter 
takes a long time to process a message with very long tokens.  His short 
test message had 100,000 x's in a row (divided into 76 character 
quoted-printable lines) and took approx 6 seconds to process.  The long 
message had 600,000 x's in the same form and took 150 seconds.  The time 
appears to be spent deep inside of flex's pattern matching code.

We also know that real words don't contain thousands of consecutive letters 
and that bogofilter ultimately tosses any token longer than MAXTOKENLEN, 
which is currently 30.

Since we know that bogofilter is going to pitch really long tokens, why not 
add a special check in bogofilter's input routine and pitch excessively 
long alphanumerics?  The check would be after saving text for the 
passthrough option and after any needed decoding (base64, qp, etc) ...

Just an idea for a quick and simple solution ...  What do y'all 
think?  Good idea or bad?

David