Encoded headers parser

Fri Jul 25 06:21:48 CEST 2003

Junior,

Thanks for reporting the problem.  As you probably realize, parsing encoded 
headers is new and the proper lexer pattern is still being worked on.  My 
latest effort (below) handles your test case.  I've also tested it with 
approx 100 encoded subject lines that I found in my email archives and 
don't see anything bad (except for the decoding of iso-2022-jp).  Give it a 
try and let me know what problems you encounter.

--- lexer_v3.l  24 Jul 2003 20:48:51 -0000      1.40
+++ lexer_v3.l  25 Jul 2003 03:58:19 -0000
@@ -126,8 +126,7 @@

  TOKEN          {TOKENFRONT}{TOKENMID}{TOKENBACK}{1,70}

-B64_OR_QP      [0-9a-zA-Z\-\+\/\=_:]+
-ENCODED_TOKEN  {BOGOLEX_TOKEN}{0,60}=\?{ID}\?[bq]\?{B64_OR_QP}\?\=
+ENCODED_TOKEN  {BOGOLEX_TOKEN}{0,60}=\?{ID}\?[bq]\?[^[:blank:]]+\?\=

  DOLLARS                [0-9]+
  CENTS          [0-9]+

>Another question: will bogofilter ever support double tokens store
>(phrases)? It would improve accuracy, but you don't do it because the
>performance issue?

Current thought is to get bogofilter doing the best possible job with 
single tokens.  After that we can work on phrases.