Understanding lexer_v3.l changes

Boris 'pi' Piwinger 3.14 at piology.org
Sun Nov 26 19:39:57 CET 2006


David Relson <relson at osagesoftware.com> wrote:

>> >It allows dots within IDs.
>
>ID formats vary between mail programs.  Allowing dots increases the
>set of acceptable IDs.  As we know, increasing the set of tokens can be
>both useful and unnecessary.

Right. AFAICS only one rule uses it:
:<INITIAL>\n?[[:blank:]]id{WHITESPACE}+{ID}      { return QUEUE_ID; }

The \n? does not seem to have any function. But I do see
dots in IDs, so I will also add it.

>> >As best I can tell, your suggestions add up to "SHORT_TOKEN isn't
>> >necessary. A few changes to TOKEN can eliminate it."  Even if that's
>> >not exactly what you're thinking, I've eliminated SHORT_TOKEN without
>> >breaking "make check".
>> 
>> Actually, this is exactly what I thought would be possible.
>> 
>> In my version in addition I made TOKENFRONT, TOKENMID and
>> TOKENBACK the same. This *does* *change* the meaning, but
>> works perfectly for me while reducing complexity.
>
>That opens up a whole can of worms.  What characters should be allowed
>at the beginning of a token?  In the middle?  At the back?  For example
>where should apostrophes be allowed?  Bogofilter allows "it's" but
>interprets "'tis" as "tis".  Allowing apostrophes anywhere makes '''''
>a valid token, which I don't want.  Does it really matter?  To be
>honest, I doubt it.

You are right, it will create tokens totally
counterintuitive. In the end it is statistics. AAMOF I do
not allow ', but this is probably just taste. This will
reduce English genitives to the core which might help or
not. I won't add things like "won't", "it's" etc., but those
are most likely not siginificant anyway.

>> >File token.c had some special processing to allow 2 character money
>> >tokens, i.e. "$1", "$2", etc.  The MONEY code allows a cleaner
>> >implementation of this special case.
>> 
>> I see. I had removed this clause and don't allow $ in TOKEN
>> at all. Maybe it should be retested if the additional
>> complexity of currency handling does add any benefit.
>
>Allowing money amounts does matter.  Here are my scores for single
>digit dollar amounts:
>
>     spam  good    Fisher
>$1   9434  1176  0.719198
>$2   4778   626  0.709037
>$3   7751   409  0.858166
>$4   4543   182  0.888510
>$5  19691   524  0.923063
>$6   2912   115  0.889920
>$7   8135   164  0.940606
>$8   3035   118  0.891441
>$9   8085   150  0.945080

Yes, the tokens are significant, but would classification be
worse if in training they would have been ignored? 

pi



More information about the Bogofilter mailing list