[PATCH] Better tagging.

Mon Sep 15 02:09:05 CEST 2003

On 15 Sep 2003 09:49:18 +1000
michael at optusnet.com.au wrote:

> David Relson <relson at osagesoftware.com> writes:
> > On Sun, 14 Sep 2003 23:41:40 +0200
> [...] 
> > Matthias,
> > 
> > Yes.  We should confirm that the changes make a difference.  I have
> > a test version of lexer_v3.l that can operate identically to current
> > cvs or can operate in the new mode and have been looking at what
> > happens. It's not clear that all changes are implemented properly
> > and/or are useful.  Here are two examples,
> >
> > The modified rules include spaces in tokens line "h:Mime-Version:
> > 1.0". Currently tokens can't have spaces, a detail that bogoutil
> > cares about.
> >
> > Currently 'charset=us-ascii' and 'charset="us-ascii"' both generate
> > 'charset' and 'us-ascii'.  With the new rules they generate
> > 'h:charset=us-ascii' and 'h:charset="us-ascii"', which is another
> > inclusion of an illegal character.
> 
> Can we not do a s/[ ]/_/g; or similiar to remove the illegal
> characters?

Indeed we can (and likely will).  I find it important to understand
changes and their consequences.  Without testing, I wouldn't have
noticed these details, which I suspect would eventually cause a problem.

> Noting that for me at least, 'h:charset=us-ascii' and
> 'h:charset="us-ascii"' have different spamicity values.
> 
>                        spam    good  Gra prob  Rob/Fis
> h:charset=US-ASCII      156    2337  0.066605  0.099997
> h:charset="US-ASCII"     32     620  0.052289  0.156971
> h:charset="us-ascii"   1284    2833  0.326373  0.331634
> h:Charset=US-ASCII        1      12  0.081796  0.399363
> h:CHARSET=US-ASCII       14       8  0.651658  0.433847
> h:charset=us-ascii     2285    1700  0.589635  0.579120
> 
> How about we make '"' a legal character? :)

Not a good idea.  '"' is currently used in Rtable output.  Allowing it
in tokens would break that.

> > The new rules create 'h:Date' from a 'Date:' statement.  I doubt
> > this is useful.
> 
>                        spam    good  Gra prob  Rob/Fis
> h:DATE                  175      21  0.899074  0.627728
> h:Date                24652   26488  0.498721  0.498303

Evidence is good to have, even if it shows I might be wrong :-(

> I guess my point is that all these items are hints that bogofilter
> currently throws away.  I'm not saying they always make a difference,
> but for my data set they definately do.

And I'd like to verify their effect on my data, as well.

While on the subject of effect verification, Paul Graham's article on
"Better Bayesian Filtering" suggested a variety of changes which have
been added to bogofilter because they've been found useful (mostly). 
That's why parsing changed from case-insensitive to case-sensitive, etc.
 One idea he proposed was called "token degeneration", which provides
rules for matching an unknown token with a known token, i.e. given
unknown token "DATE", see if "Date" or "date" is known and use the known
value.  Sounds good, eh?  Once implemented, my testing indicated that a
fully case-sensitive wordlist, i.e. built and used in a case-sensitive
manner, is better than any hybrid.  However, bogofilter has "token
degeneration" abilities - which are disabled by default.

And the moral is:  "Confirm effectiveness of modifications before
committing to them".