[PATCH] Better tagging.

Sun Sep 14 23:57:42 CEST 2003

On Sun, 14 Sep 2003 23:41:40 +0200
Matthias Andree <matthias.andree at gmx.de> wrote:

> michael at optusnet.com.au writes:
> 
> > This degrades the performance of bogofilter about 4% on
> > my dataset. I.e. the number of false negatives is about 
> > approx 14% before this patch, and approx 10% after. False
> > positives aren't changed. (they _may_ be better, but the
> > numbers are too small to be reliable).
> >
> > This is pretty much what I'd expect. The more info
> > you feed it, the better it is at discrimination.
> 
> Well, the patch looks very useful to me, since we can now investigate
> closer what tokens are good indicators and we separate header from
> body information. David, any objections to merging the stuff in one go
> (save for polishing it)?

Matthias,

Yes.  We should confirm that the changes make a difference.  I have a
test version of lexer_v3.l that can operate identically to current cvs
or can operate in the new mode and have been looking at what happens. 
It's not clear that all changes are implemented properly and/or are
useful.  Here are two examples,

The modified rules include spaces in tokens line "h:Mime-Version: 1.0". 
Currently tokens can't have spaces, a detail that bogoutil cares about.

Currently 'charset=us-ascii' and 'charset="us-ascii"' both generate
'charset' and 'us-ascii'.  With the new rules they generate
'h:charset=us-ascii' and 'h:charset="us-ascii"', which is another
inclusion of an illegal character.

The new rules create 'h:Date' from a 'Date:' statement.  I doubt this is
useful.

As I said above, I want to run a real test before accepting them.

Michael also reported a bug (or was it two?) and I've attended to that.

David