length vs speed

Sat Feb 22 14:10:08 CET 2003

On 20030221 (Fri) at 2357:46 -0500, David Relson wrote:
> The long 
> message had 600,000 x's in the same form and took 150 seconds.  The time 
> appears to be spent deep inside of flex's pattern matching code.
> 
> We also know that real words don't contain thousands of consecutive letters 
> and that bogofilter ultimately tosses any token longer than MAXTOKENLEN, 
> which is currently 30.

Tosses??  What about truncating instead?  That way
pseudoantidisestablishmentarianism can still be counted ;)  (Might be
important in languages like German, where combos that stay separated in
English, like "atom bomb fallout survival shelter," get built as single
words.)

> Since we know that bogofilter is going to pitch really long tokens, why not 
> add a special check in bogofilter's input routine and pitch excessively 
> long alphanumerics?  The check would be after saving text for the 
> passthrough option and after any needed decoding (base64, qp, etc) ...

Seems attractive at first glance -- again, chop rather than chuck would
be my preference, but maybe it's not worth it -- what do other people
think?

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |
| Help free our mailboxes. Include                   |
|        http://wecanstopspam.org in your signature. |