Understanding lexer_v3.l changes

Sun Nov 26 20:03:32 CET 2006

On Sun, 26 Nov 2006 19:39:57 +0100
Boris 'pi' Piwinger wrote:

> David Relson <relson at osagesoftware.com> wrote:
> 
> >> >It allows dots within IDs.
> >
> >ID formats vary between mail programs.  Allowing dots increases the
> >set of acceptable IDs.  As we know, increasing the set of tokens can
> >be both useful and unnecessary.
> 
> Right. AFAICS only one rule uses it:
> :<INITIAL>\n?[[:blank:]]id{WHITESPACE}+{ID}      { return QUEUE_ID; }
> 
> The \n? does not seem to have any function. But I do see
> dots in IDs, so I will also add it.

Not certain, but it's likely there to help line folding

> 
> >> >As best I can tell, your suggestions add up to "SHORT_TOKEN isn't
> >> >necessary. A few changes to TOKEN can eliminate it."  Even if
> >> >that's not exactly what you're thinking, I've eliminated
> >> >SHORT_TOKEN without breaking "make check".
> >> 
> >> Actually, this is exactly what I thought would be possible.
> >> 
> >> In my version in addition I made TOKENFRONT, TOKENMID and
> >> TOKENBACK the same. This *does* *change* the meaning, but
> >> works perfectly for me while reducing complexity.
> >
> >That opens up a whole can of worms.  What characters should be
> >allowed at the beginning of a token?  In the middle?  At the back?
> >For example where should apostrophes be allowed?  Bogofilter allows
> >"it's" but interprets "'tis" as "tis".  Allowing apostrophes
> >anywhere makes ''''' a valid token, which I don't want.  Does it
> >really matter?  To be honest, I doubt it.
> 
> You are right, it will create tokens totally
> counterintuitive. In the end it is statistics. AAMOF I do
> not allow ', but this is probably just taste. This will
> reduce English genitives to the core which might help or
> not. I won't add things like "won't", "it's" etc., but those
> are most likely not siginificant anyway.
> 
> >> >File token.c had some special processing to allow 2 character
> >> >money tokens, i.e. "$1", "$2", etc.  The MONEY code allows a
> >> >cleaner implementation of this special case.
> >> 
> >> I see. I had removed this clause and don't allow $ in TOKEN
> >> at all. Maybe it should be retested if the additional
> >> complexity of currency handling does add any benefit.
> >
> >Allowing money amounts does matter.  Here are my scores for single
> >digit dollar amounts:
> >
> >     spam  good    Fisher
> >$1   9434  1176  0.719198
> >$2   4778   626  0.709037
> >$3   7751   409  0.858166
> >$4   4543   182  0.888510
> >$5  19691   524  0.923063
> >$6   2912   115  0.889920
> >$7   8135   164  0.940606
> >$8   3035   118  0.891441
> >$9   8085   150  0.945080
> 
> Yes, the tokens are significant, but would classification be
> worse if in training they would have been ignored? 

Probably not.  Any small group of tokens can be removed from your
wordlist without much effect.

People running bogofilter after a major database problem have shown
that even after removing any _one_ of the following token groups:

   tokens beginning with lower case letters
   tokens beginning with upper case letters
   tokens beginning with a-m
   tokens beginning with n-z

bogofilter still functions well.  Of course, it works better when
they're _all_ present.

If you remember, early on bogofilter converted upper case to lower
case.  Disabling that increased the wordlist size and bogofilter's
accuracy.