[PATCH] Better tagging.

michael at optusnet.com.au michael at optusnet.com.au
Mon Sep 15 01:49:18 CEST 2003


David Relson <relson at osagesoftware.com> writes:
> On Sun, 14 Sep 2003 23:41:40 +0200
[...] 
> Matthias,
> 
> Yes.  We should confirm that the changes make a difference.  I have a
> test version of lexer_v3.l that can operate identically to current cvs
> or can operate in the new mode and have been looking at what happens. 
> It's not clear that all changes are implemented properly and/or are
> useful.  Here are two examples,
>
> The modified rules include spaces in tokens line "h:Mime-Version: 1.0". 
> Currently tokens can't have spaces, a detail that bogoutil cares about.
>
> Currently 'charset=us-ascii' and 'charset="us-ascii"' both generate
> 'charset' and 'us-ascii'.  With the new rules they generate
> 'h:charset=us-ascii' and 'h:charset="us-ascii"', which is another
> inclusion of an illegal character.

Can we not do a s/[ ]/_/g; or similiar to remove the illegal
characters?

Noting that for me at least, 'h:charset=us-ascii' and
'h:charset="us-ascii"' have different spamicity values.

                       spam    good  Gra prob  Rob/Fis
h:charset=US-ASCII      156    2337  0.066605  0.099997
h:charset="US-ASCII"     32     620  0.052289  0.156971
h:charset="us-ascii"   1284    2833  0.326373  0.331634
h:Charset=US-ASCII        1      12  0.081796  0.399363
h:CHARSET=US-ASCII       14       8  0.651658  0.433847
h:charset=us-ascii     2285    1700  0.589635  0.579120

How about we make '"' a legal character? :)

> The new rules create 'h:Date' from a 'Date:' statement.  I doubt this is
> useful.

                       spam    good  Gra prob  Rob/Fis
h:DATE                  175      21  0.899074  0.627728
h:Date                24652   26488  0.498721  0.498303

I guess my point is that all these items are hints that bogofilter
currently throws away.  I'm not saying they always make a difference,
but for my data set they definately do.

Michael.




More information about the bogofilter-dev mailing list