Understanding lexer_v3.l changes
David Relson
relson at osagesoftware.com
Sun Nov 26 18:53:06 CET 2006
On Sun, 26 Nov 2006 18:28:49 +0100
Boris 'pi' Piwinger wrote:
> David Relson <relson at osagesoftware.com> wrote:
>
...[snip]...
> >It allows dots within IDs.
ID formats vary between mail programs. Allowing dots increases the
set of acceptable IDs. As we know, increasing the set of tokens can be
both useful and unnecessary.
> Obviously, it does, but why? It used to work without the
> dots for ages.
..[snip]...
> >As best I can tell, your suggestions add up to "SHORT_TOKEN isn't
> >necessary. A few changes to TOKEN can eliminate it." Even if that's
> >not exactly what you're thinking, I've eliminated SHORT_TOKEN without
> >breaking "make check".
>
> Actually, this is exactly what I thought would be possible.
>
> In my version in addition I made TOKENFRONT, TOKENMID and
> TOKENBACK the same. This *does* *change* the meaning, but
> works perfectly for me while reducing complexity.
That opens up a whole can of worms. What characters should be allowed
at the beginning of a token? In the middle? At the back? For example
where should apostrophes be allowed? Bogofilter allows "it's" but
interprets "'tis" as "tis". Allowing apostrophes anywhere makes '''''
a valid token, which I don't want. Does it really matter? To be
honest, I doubt it.
...[snip]...
> >
> >File token.c had some special processing to allow 2 character money
> >tokens, i.e. "$1", "$2", etc. The MONEY code allows a cleaner
> >implementation of this special case.
>
> I see. I had removed this clause and don't allow $ in TOKEN
> at all. Maybe it should be retested if the additional
> complexity of currency handling does add any benefit.
Allowing money amounts does matter. Here are my scores for single
digit dollar amounts:
spam good Fisher
$1 9434 1176 0.719198
$2 4778 626 0.709037
$3 7751 409 0.858166
$4 4543 182 0.888510
$5 19691 524 0.923063
$6 2912 115 0.889920
$7 8135 164 0.940606
$8 3035 118 0.891441
$9 8085 150 0.945080
> >I've attached a patch file with the changes from 1.1.1 to current cvs
> >for lexer_v3.l and token.c. If you have further improvements (that
> >don't break "make check"), I'm all ears.
>
> Should be most. The other ones will change the meaning which
> may or may not be beneficial.
'Tis hard to say :-<
> You might want to try `diff -b -B` against my lexer for more
> ideas.
>
> What I do see (no idea when we went different directions) is
> <HTOKEN>({TOKEN}) vs. <HTOKEN>{TOKEN} and the same without
> <HTOKEN>. Does this make any difference? If not I would go
> for the simpler version.
The parens don't have any effect I can detect, and have been removed.
Regards,
David
More information about the Bogofilter
mailing list