[PATCH] Better tagging.
michael at optusnet.com.au
michael at optusnet.com.au
Mon Sep 15 01:49:18 CEST 2003
David Relson <relson at osagesoftware.com> writes:
> On Sun, 14 Sep 2003 23:41:40 +0200
[...]
> Matthias,
>
> Yes. We should confirm that the changes make a difference. I have a
> test version of lexer_v3.l that can operate identically to current cvs
> or can operate in the new mode and have been looking at what happens.
> It's not clear that all changes are implemented properly and/or are
> useful. Here are two examples,
>
> The modified rules include spaces in tokens line "h:Mime-Version: 1.0".
> Currently tokens can't have spaces, a detail that bogoutil cares about.
>
> Currently 'charset=us-ascii' and 'charset="us-ascii"' both generate
> 'charset' and 'us-ascii'. With the new rules they generate
> 'h:charset=us-ascii' and 'h:charset="us-ascii"', which is another
> inclusion of an illegal character.
Can we not do a s/[ ]/_/g; or similiar to remove the illegal
characters?
Noting that for me at least, 'h:charset=us-ascii' and
'h:charset="us-ascii"' have different spamicity values.
spam good Gra prob Rob/Fis
h:charset=US-ASCII 156 2337 0.066605 0.099997
h:charset="US-ASCII" 32 620 0.052289 0.156971
h:charset="us-ascii" 1284 2833 0.326373 0.331634
h:Charset=US-ASCII 1 12 0.081796 0.399363
h:CHARSET=US-ASCII 14 8 0.651658 0.433847
h:charset=us-ascii 2285 1700 0.589635 0.579120
How about we make '"' a legal character? :)
> The new rules create 'h:Date' from a 'Date:' statement. I doubt this is
> useful.
spam good Gra prob Rob/Fis
h:DATE 175 21 0.899074 0.627728
h:Date 24652 26488 0.498721 0.498303
I guess my point is that all these items are hints that bogofilter
currently throws away. I'm not saying they always make a difference,
but for my data set they definately do.
Michael.
More information about the bogofilter-dev
mailing list