length vs speed
Greg Louis
glouis at dynamicro.on.ca
Sat Feb 22 14:10:08 CET 2003
On 20030221 (Fri) at 2357:46 -0500, David Relson wrote:
> The long
> message had 600,000 x's in the same form and took 150 seconds. The time
> appears to be spent deep inside of flex's pattern matching code.
>
> We also know that real words don't contain thousands of consecutive letters
> and that bogofilter ultimately tosses any token longer than MAXTOKENLEN,
> which is currently 30.
Tosses?? What about truncating instead? That way
pseudoantidisestablishmentarianism can still be counted ;) (Might be
important in languages like German, where combos that stay separated in
English, like "atom bomb fallout survival shelter," get built as single
words.)
> Since we know that bogofilter is going to pitch really long tokens, why not
> add a special check in bogofilter's input routine and pitch excessively
> long alphanumerics? The check would be after saving text for the
> passthrough option and after any needed decoding (base64, qp, etc) ...
Seems attractive at first glance -- again, chop rather than chuck would
be my preference, but maybe it's not worth it -- what do other people
think?
--
| G r e g L o u i s | gpg public key: |
| http://www.bgl.nu/~glouis | finger greg at bgl.nu |
| Help free our mailboxes. Include |
| http://wecanstopspam.org in your signature. |
More information about the bogofilter-dev
mailing list