Test with different lexers

Tue Dec 2 14:41:31 CET 2003

On Tue, 02 Dec 2003 14:19:23 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:

> Hi!
> 
> I have done another test with bogofilter's new lexer and my
> version (http://piology.org/bogofilter/lexer_v3.l):

...[snip]...

> Over the time we have introduced several special rules to
> deal with specific problematic messages. My version has
> removed some of those (different token front and back,
> dollar rule, no short tokens, no numeric tokens, doctype
> switch, maybe more).

pi,

This description concerns me.  Some of the removed rules have only been
in your private version of bogofilter.  They have never been in a
released version.  Two examples are short tokens and numeric tokens. 
Your earlier tests found them useful, though my tests didn't.

Bogofilter has different front and back tokens for good reason.  For
example we don't want digits at the beginning of a token, but they're
fine at the end, i.e. "12abcd34" parses as "abcd34".  Your description
indicates that you're parsing it as 8 characters (with all digits) or 4
characters (no digits).  As a second example, "!" is accepted at the end
(but not the beginning), reflecting common spammer usage.

David