New header token tagging

David Relson relson at osagesoftware.com
Thu Sep 25 18:05:26 CEST 2003


Greetings,

As you all know, in 0.15.4 header token tagging has been expanded. 
Previously tokens in "To: ", "From: ", "Subject: ", and "Return-Path: "
lines were given a prefix of "to:", "from:", "subj:", and "strn:"
respectively.   In addition to that tagging, 0.15.4 tags tokens in
"Received: " with "rcvd:" and all other tokens in header lines are
tagged "head:".

So far, three users have reported their experiences of the effect of the
modified tagging:

In my environment all incoming messages are fed into the wordlists (by
'-u' with manual corrections).  My testing indicates a small advantage
to the new tagging.  In actual use, I've not noticed any difference -
either for good or ill.

Michael proposed the changes and reports a 4% improvement in spam
filtering.  He originally proposed using "h:" for the prefix, but I
implemented "head:" for consistency with the other tags which are all 4
letters plus colon.

Greg uses train-on-error and has seen his false positive rate skyrocket.
 It's so bad for him that he has requested a way to turn off the new
tagging.  I've sent him two patches.  The first implements '-H' to turn
off "head:" token tagging and the second implements simple degeneration
lookup, i.e. if "head:whatever" isn't found, look for "whatever".  I
haven't yet heard from him which patch he prefers.

Question 1:  Has anybody else noticed an effect from the new header
tagging?  If so, what have you noticed?

Question 2:  Have y'all a preference for "h:" vs. "head:"?

Question 3:  Have y'all a preference for what '-H' should do?

Looking forward to hearing from you.

David




More information about the Bogofilter mailing list