Test with different lexers

Tue Dec 2 15:01:52 CET 2003

David Relson wrote:

>> Over the time we have introduced several special rules to
>> deal with specific problematic messages. My version has
>> removed some of those (different token front and back,
>> dollar rule, no short tokens, no numeric tokens, doctype
>> switch, maybe more).
> 
> This description concerns me.  Some of the removed rules have only been
> in your private version of bogofilter. 

This must be a misunderstanding. Maybe I did not write clear
enough.

> Two examples are short tokens and numeric tokens. 

Right, they are excluded in the standard lexer.

> Your earlier tests found them useful, though my tests didn't.

This is the point Tom made. We pose resctrictions on the
lexer based on certain ideas. Ideas like "tokens of lenght
one or two are not helpful", "tokens starting with a digit
are not helpful", "tokens with an ! in the end are useful".

It is also true that in several tests I posted to the list,
single changes I made were indifferent or maybe even a
little worse in testing. But combined they work for me, if
they work for others, I cannot say, but I don't see why they
should fail.

> Bogofilter has different front and back tokens for good reason.  For
> example we don't want digits at the beginning of a token, but they're
> fine at the end, i.e. "12abcd34" parses as "abcd34".  

I question this. I fail to see a theoretical argument, why
one would be fine, but not the other. If it works, well,
that might well depend on the individual mail collection.
For me they work, for you I don't know.

> Your description
> indicates that you're parsing it as 8 characters (with all digits) or 4
> characters (no digits). 

Sorry, if I was not clear. Your example token would be read
in full, so I get "12abcd34" and not the standard "abcd34".

> As a second example, "!" is accepted at the end
> (but not the beginning), reflecting common spammer usage.

This is a nice example of an idea which sounds totally
reasonable. In my test (which I did post), though, it was
actually indifferent, so in some test it worked better in
another worse. With rules like this we try to code some
actual technique we see as humans into bogofilter, so we
want to be more clever than the statistics. It might well
work out in some cases, it might also surprise us or change
nothing in effect.

We also have no understanding how different rules play
together, do they remain useful if combined? Could be, maybe
not. So this test was designed to get as much of those a
priori judgements out as seemed reasonable to me (others
might go even further or not all that far). My result being
that we can just as well leave those out.

pi