Understanding lexer_v3.l changes
David Relson
relson at osagesoftware.com
Sun Nov 26 20:03:32 CET 2006
On Sun, 26 Nov 2006 19:39:57 +0100
Boris 'pi' Piwinger wrote:
> David Relson <relson at osagesoftware.com> wrote:
>
> >> >It allows dots within IDs.
> >
> >ID formats vary between mail programs. Allowing dots increases the
> >set of acceptable IDs. As we know, increasing the set of tokens can
> >be both useful and unnecessary.
>
> Right. AFAICS only one rule uses it:
> :<INITIAL>\n?[[:blank:]]id{WHITESPACE}+{ID} { return QUEUE_ID; }
>
> The \n? does not seem to have any function. But I do see
> dots in IDs, so I will also add it.
Not certain, but it's likely there to help line folding
>
> >> >As best I can tell, your suggestions add up to "SHORT_TOKEN isn't
> >> >necessary. A few changes to TOKEN can eliminate it." Even if
> >> >that's not exactly what you're thinking, I've eliminated
> >> >SHORT_TOKEN without breaking "make check".
> >>
> >> Actually, this is exactly what I thought would be possible.
> >>
> >> In my version in addition I made TOKENFRONT, TOKENMID and
> >> TOKENBACK the same. This *does* *change* the meaning, but
> >> works perfectly for me while reducing complexity.
> >
> >That opens up a whole can of worms. What characters should be
> >allowed at the beginning of a token? In the middle? At the back?
> >For example where should apostrophes be allowed? Bogofilter allows
> >"it's" but interprets "'tis" as "tis". Allowing apostrophes
> >anywhere makes ''''' a valid token, which I don't want. Does it
> >really matter? To be honest, I doubt it.
>
> You are right, it will create tokens totally
> counterintuitive. In the end it is statistics. AAMOF I do
> not allow ', but this is probably just taste. This will
> reduce English genitives to the core which might help or
> not. I won't add things like "won't", "it's" etc., but those
> are most likely not siginificant anyway.
>
> >> >File token.c had some special processing to allow 2 character
> >> >money tokens, i.e. "$1", "$2", etc. The MONEY code allows a
> >> >cleaner implementation of this special case.
> >>
> >> I see. I had removed this clause and don't allow $ in TOKEN
> >> at all. Maybe it should be retested if the additional
> >> complexity of currency handling does add any benefit.
> >
> >Allowing money amounts does matter. Here are my scores for single
> >digit dollar amounts:
> >
> > spam good Fisher
> >$1 9434 1176 0.719198
> >$2 4778 626 0.709037
> >$3 7751 409 0.858166
> >$4 4543 182 0.888510
> >$5 19691 524 0.923063
> >$6 2912 115 0.889920
> >$7 8135 164 0.940606
> >$8 3035 118 0.891441
> >$9 8085 150 0.945080
>
> Yes, the tokens are significant, but would classification be
> worse if in training they would have been ignored?
Probably not. Any small group of tokens can be removed from your
wordlist without much effect.
People running bogofilter after a major database problem have shown
that even after removing any _one_ of the following token groups:
tokens beginning with lower case letters
tokens beginning with upper case letters
tokens beginning with a-m
tokens beginning with n-z
bogofilter still functions well. Of course, it works better when
they're _all_ present.
If you remember, early on bogofilter converted upper case to lower
case. Disabling that increased the wordlist size and bogofilter's
accuracy.
More information about the Bogofilter
mailing list