Encoded headers parser
David Relson
relson at osagesoftware.com
Fri Jul 25 06:21:48 CEST 2003
Junior,
Thanks for reporting the problem. As you probably realize, parsing encoded
headers is new and the proper lexer pattern is still being worked on. My
latest effort (below) handles your test case. I've also tested it with
approx 100 encoded subject lines that I found in my email archives and
don't see anything bad (except for the decoding of iso-2022-jp). Give it a
try and let me know what problems you encounter.
--- lexer_v3.l 24 Jul 2003 20:48:51 -0000 1.40
+++ lexer_v3.l 25 Jul 2003 03:58:19 -0000
@@ -126,8 +126,7 @@
TOKEN {TOKENFRONT}{TOKENMID}{TOKENBACK}{1,70}
-B64_OR_QP [0-9a-zA-Z\-\+\/\=_:]+
-ENCODED_TOKEN {BOGOLEX_TOKEN}{0,60}=\?{ID}\?[bq]\?{B64_OR_QP}\?\=
+ENCODED_TOKEN {BOGOLEX_TOKEN}{0,60}=\?{ID}\?[bq]\?[^[:blank:]]+\?\=
DOLLARS [0-9]+
CENTS [0-9]+
>Another question: will bogofilter ever support double tokens store
>(phrases)? It would improve accuracy, but you don't do it because the
>performance issue?
Current thought is to get bogofilter doing the best possible job with
single tokens. After that we can work on phrases.
More information about the Bogofilter
mailing list