lexer change

Wed Nov 12 10:12:34 CET 2003

Tom Anderson wrote:

>> I don't know who made it. Bug not allowing punctuation in
>> general seems reasonable for a word. "word" should give
>> 'word' not '"word"'. For the $-sign we could add it, I don't
>> have a strong opinion here, it seems that that rule doesn't
>> change much anyway.
> 
> I'd disagree.  I don't think "word" is the correct term, but "token",
> which includes words as a subset. 

You are right, that was not meant to be strict.

> And a token should be allowed to be
> almost anything except for seperators, which in most cases includes
> spaces of varying types, commas, semicolons, and maybe pipes.  Spaces
> are the most important, and could stand on their own.  I'd think of an
> email as a list of tokens delimited by seperators.  We shouldn't be too
> concerned about what the tokens consist of, just whether or not the
> seperators are encoded or plain-text.  I see no reason why '"word"'
> couldn't be a perfectly valid token that may even be strongly indicative
> of spam or not. 

Well, you lose a lot of words if you do allow punctuation
like quotes. For $ I see reasons to allow it everywhere,
though. For example you often read $cientology or Micro$oft,
even though those are formally not words the could make good
tokens.

> It's like arbitrarily splitting the word "plain-text"
> into two words... completely unnecessary and possibly detrimental.

Actually we allow for hyphens. I have another idea which is
similar to what Google does: Let's find tokens first (pretty
general), then remove from a token all "special characters"
(like -, ', ´ etc.). This would find the token plain-text
and generate plaintext for the wordlist. Why could that be
useful? Well, many people don't know that ' is the right
character for an apostrophe, but they use ´, so that's
getting different tokens. In German many people use ' for
genitives which is simply wrong, so the word with and
without are identified. Sounds reasonable to me. And
certainly there are good reasons Google does it. On the
other hand we lose some information which might have been
useful.

>> > TOKENFRONT	[^[:blank:][:cntrl:][:digit:][:punct:]]
>> > TOKENMID	[^[:blank:]<>;=():&%$#@+|/\\{}^\"?*,[:cntrl:][\]]*
>> > TOKENBACK	[^[:blank:]<>;=():&%$#@+|/\\{}^\"?*,[:cntrl:][\]._+-]
>> 
>> If I can trust my eyes (I usally cannot;-) those characters
>> are allowed to show up in the middle of a word, but not at
>> the beginning: !'-._`~ (which looks OK).
> 
> It seems overly arbitrary to me.  The set of seperators should be very
> small, and a token should be anything between seperators.

I don't see this. Those seven characters only show up in the
middle of real words (some not even there, but spammers are
known to use them anyway). I don't see them at the beginning
or end of a word.

> I believe the dash (-) should be escaped too, since that is used in
> character ranges.

Not if at the beginning or end of a character class.

>> At the end of a word we only allow !'` in addition to those
>> allowed at the front. I cannot say why ' or ` should be
>> there. I'd disallow those. And by your argument also remove
>> ! -- even though it "works".
> 
> By my argument, we shouldn't disallow anything.  Eg. "why?" or "class'"
> or "100%" or "[sic]", etc., may be common tokens.

I really dislike any punctuation in tokens. That multiplies
them without a good reason.

pi