Test with different lexers
Tom Anderson
tanderso at oac-design.com
Tue Dec 2 15:47:48 CET 2003
On Tue, 2003-12-02 at 08:41, David Relson wrote:
> Bogofilter has different front and back tokens for good reason. For
> example we don't want digits at the beginning of a token, but they're
> fine at the end, i.e. "12abcd34" parses as "abcd34". Your description
> indicates that you're parsing it as 8 characters (with all digits) or 4
> characters (no digits). As a second example, "!" is accepted at the end
> (but not the beginning), reflecting common spammer usage.
I don't see the logic in that. It seems pretty arbitrary.
To repeat to the list what I previously said in a private email: The
difference comes from what you consider a "special" character in a
token. To me, every non-space ascii ought to be allowed anywhere in any
token. Why would we give special consideration to A-Za-z? I think you
assume too much about what a token _ought_ to consist of, rather than
what it _does_ consist of. How about "100%"? Or ";-)"? Or "[sic]"?
These are important tokens! I don't think we should be assuming that
tokens must be proper english words.
Tom
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20031202/f2598b20/attachment.sig>
More information about the Bogofilter
mailing list