lexer change

Wed Nov 12 08:36:10 CET 2003

On Tue, 2003-11-11 at 09:39, Boris 'pi' Piwinger wrote:
> I don't know who made it. Bug not allowing punctuation in
> general seems reasonable for a word. "word" should give
> 'word' not '"word"'. For the $-sign we could add it, I don't
> have a strong opinion here, it seems that that rule doesn't
> change much anyway.

I'd disagree.  I don't think "word" is the correct term, but "token",
which includes words as a subset.  And a token should be allowed to be
almost anything except for seperators, which in most cases includes
spaces of varying types, commas, semicolons, and maybe pipes.  Spaces
are the most important, and could stand on their own.  I'd think of an
email as a list of tokens delimited by seperators.  We shouldn't be too
concerned about what the tokens consist of, just whether or not the
seperators are encoded or plain-text.  I see no reason why '"word"'
couldn't be a perfectly valid token that may even be strongly indicative
of spam or not.  It's like arbitrarily splitting the word "plain-text"
into two words... completely unnecessary and possibly detrimental.

If certain tokens are only identified very infrequently and thus make
for an inefficient database, then an optimization measure should be to
periodically purge the database rather than try to anticipate
preemptively what these tokens may be.  This way, it will still work as
frequencies evolve over time.

> > TOKENFRONT	[^[:blank:][:cntrl:][:digit:][:punct:]]
> > TOKENMID	[^[:blank:]<>;=():&%$#@+|/\\{}^\"?*,[:cntrl:][\]]*
> > TOKENBACK	[^[:blank:]<>;=():&%$#@+|/\\{}^\"?*,[:cntrl:][\]._+-]
> 
> If I can trust my eyes (I usally cannot;-) those characters
> are allowed to show up in the middle of a word, but not at
> the beginning: !'-._`~ (which looks OK).

It seems overly arbitrary to me.  The set of seperators should be very
small, and a token should be anything between seperators.

> BTW: There is a + doubled in TOKENBACK.

I believe the dash (-) should be escaped too, since that is used in
character ranges.

> At the end of a word we only allow !'` in addition to those
> allowed at the front. I cannot say why ' or ` should be
> there. I'd disallow those. And by your argument also remove
> ! -- even though it "works".

By my argument, we shouldn't disallow anything.  Eg. "why?" or "class'"
or "100%" or "[sic]", etc., may be common tokens.

Tom

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <https://www.bogofilter.org/pipermail/bogofilter/attachments/20031112/6efde3a2/attachment.sig>