Understanding lexer_v3.l changes
Boris 'pi' Piwinger
3.14 at piology.org
Sun Nov 26 19:39:57 CET 2006
David Relson <relson at osagesoftware.com> wrote:
>> >It allows dots within IDs.
>
>ID formats vary between mail programs. Allowing dots increases the
>set of acceptable IDs. As we know, increasing the set of tokens can be
>both useful and unnecessary.
Right. AFAICS only one rule uses it:
:<INITIAL>\n?[[:blank:]]id{WHITESPACE}+{ID} { return QUEUE_ID; }
The \n? does not seem to have any function. But I do see
dots in IDs, so I will also add it.
>> >As best I can tell, your suggestions add up to "SHORT_TOKEN isn't
>> >necessary. A few changes to TOKEN can eliminate it." Even if that's
>> >not exactly what you're thinking, I've eliminated SHORT_TOKEN without
>> >breaking "make check".
>>
>> Actually, this is exactly what I thought would be possible.
>>
>> In my version in addition I made TOKENFRONT, TOKENMID and
>> TOKENBACK the same. This *does* *change* the meaning, but
>> works perfectly for me while reducing complexity.
>
>That opens up a whole can of worms. What characters should be allowed
>at the beginning of a token? In the middle? At the back? For example
>where should apostrophes be allowed? Bogofilter allows "it's" but
>interprets "'tis" as "tis". Allowing apostrophes anywhere makes '''''
>a valid token, which I don't want. Does it really matter? To be
>honest, I doubt it.
You are right, it will create tokens totally
counterintuitive. In the end it is statistics. AAMOF I do
not allow ', but this is probably just taste. This will
reduce English genitives to the core which might help or
not. I won't add things like "won't", "it's" etc., but those
are most likely not siginificant anyway.
>> >File token.c had some special processing to allow 2 character money
>> >tokens, i.e. "$1", "$2", etc. The MONEY code allows a cleaner
>> >implementation of this special case.
>>
>> I see. I had removed this clause and don't allow $ in TOKEN
>> at all. Maybe it should be retested if the additional
>> complexity of currency handling does add any benefit.
>
>Allowing money amounts does matter. Here are my scores for single
>digit dollar amounts:
>
> spam good Fisher
>$1 9434 1176 0.719198
>$2 4778 626 0.709037
>$3 7751 409 0.858166
>$4 4543 182 0.888510
>$5 19691 524 0.923063
>$6 2912 115 0.889920
>$7 8135 164 0.940606
>$8 3035 118 0.891441
>$9 8085 150 0.945080
Yes, the tokens are significant, but would classification be
worse if in training they would have been ignored?
pi
More information about the Bogofilter
mailing list