MAXWORDLEN [was: [PATCH] new patch to lexer.l, ignore my previous patch]

Wed Oct 23 18:21:55 CEST 2002

David Relson wrote:

> At 11:14 AM 10/23/02, Allyn Fratkin wrote:
>
> > i see that Matthias (or someone) patched lexer.l to have stricter
> > rules on what comprises a short line of BASE64.  the new method will let
> > through base64 lines shorter than 32 characters that don't end in =
> > padding.
> > that's fine by me, particularly if good+spam=1 tokens will expire
> > eventually.
>
>
> Shouldn't we change "#define MAXWORDLEN 20" to "#define MAXWORDLEN 32"
> or something?
>
> With the current value, all tokens longer than 20 characters are
> discarded ...

not necessarily.  base64 data can look like multiple words because it
uses the characters + and / in addition to numbers and letters.
this regular expression is attempting to recognize an entire line
to throw away, not a single word, and without any context.
a line of 32+ base64 characters would be pretty uncommon in normal text
so is virtually guaranteed to be base64, even though it probably
consists of several shorter words.

i think the current solution is a good compromise between trying to
save normal words (although i still question how often a single word
appears on a line, the only common usage i can think of is a first name
signature) and trying to discard base64.

allyn
^ not base64 but could be
-- 
Allyn Fratkin             allyn at fratkin.com
Escondido, CA             http://www.fratkin.com/