Test with different lexers

Tom Anderson tanderso at oac-design.com
Tue Dec 2 15:47:48 CET 2003


On Tue, 2003-12-02 at 08:41, David Relson wrote:
> Bogofilter has different front and back tokens for good reason.  For
> example we don't want digits at the beginning of a token, but they're
> fine at the end, i.e. "12abcd34" parses as "abcd34".  Your description
> indicates that you're parsing it as 8 characters (with all digits) or 4
> characters (no digits).  As a second example, "!" is accepted at the end
> (but not the beginning), reflecting common spammer usage.

I don't see the logic in that.  It seems pretty arbitrary.  

To repeat to the list what I previously said in a private email: The
difference comes from what you consider a "special" character in a
token.  To me, every non-space ascii ought to be allowed anywhere in any
token.  Why would we give special consideration to A-Za-z?  I think you
assume too much about what a token _ought_ to consist of, rather than
what it _does_ consist of.  How about "100%"? Or ";-)"?  Or "[sic]"? 
These are important tokens!  I don't think we should be assuming that
tokens must be proper english words.

Tom
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20031202/f2598b20/attachment.sig>


More information about the Bogofilter mailing list