Understanding lexer_v3.l changes

Sun Nov 26 18:53:06 CET 2006

On Sun, 26 Nov 2006 18:28:49 +0100
Boris 'pi' Piwinger wrote:

> David Relson <relson at osagesoftware.com> wrote:
> 
...[snip]...

> >It allows dots within IDs.

ID formats vary between mail programs.  Allowing dots increases the
set of acceptable IDs.  As we know, increasing the set of tokens can be
both useful and unnecessary.

> Obviously, it does, but why? It used to work without the
> dots for ages.

..[snip]...

> >As best I can tell, your suggestions add up to "SHORT_TOKEN isn't
> >necessary. A few changes to TOKEN can eliminate it."  Even if that's
> >not exactly what you're thinking, I've eliminated SHORT_TOKEN without
> >breaking "make check".
> 
> Actually, this is exactly what I thought would be possible.
> 
> In my version in addition I made TOKENFRONT, TOKENMID and
> TOKENBACK the same. This *does* *change* the meaning, but
> works perfectly for me while reducing complexity.

That opens up a whole can of worms.  What characters should be allowed
at the beginning of a token?  In the middle?  At the back?  For example
where should apostrophes be allowed?  Bogofilter allows "it's" but
interprets "'tis" as "tis".  Allowing apostrophes anywhere makes '''''
a valid token, which I don't want.  Does it really matter?  To be
honest, I doubt it.

...[snip]...
> >
> >File token.c had some special processing to allow 2 character money
> >tokens, i.e. "$1", "$2", etc.  The MONEY code allows a cleaner
> >implementation of this special case.
> 
> I see. I had removed this clause and don't allow $ in TOKEN
> at all. Maybe it should be retested if the additional
> complexity of currency handling does add any benefit.

Allowing money amounts does matter.  Here are my scores for single
digit dollar amounts:

     spam  good    Fisher
$1   9434  1176  0.719198
$2   4778   626  0.709037
$3   7751   409  0.858166
$4   4543   182  0.888510
$5  19691   524  0.923063
$6   2912   115  0.889920
$7   8135   164  0.940606
$8   3035   118  0.891441
$9   8085   150  0.945080

> >I've attached a patch file with the changes from 1.1.1 to current cvs
> >for lexer_v3.l and token.c.  If you have further improvements (that
> >don't break "make check"), I'm all ears.
> 
> Should be most. The other ones will change the meaning which
> may or may not be beneficial.

'Tis hard to say :-<

> You might want to try `diff -b -B` against my lexer for more
> ideas.
> 
> What I do see (no idea when we went different directions) is
> <HTOKEN>({TOKEN}) vs. <HTOKEN>{TOKEN} and the same without
> <HTOKEN>. Does this make any difference? If not I would go
> for the simpler version. 

The parens don't have any effect I can detect, and have been removed.

Regards,

David