New header token tagging
David Relson
relson at osagesoftware.com
Thu Sep 25 18:05:26 CEST 2003
Greetings,
As you all know, in 0.15.4 header token tagging has been expanded.
Previously tokens in "To: ", "From: ", "Subject: ", and "Return-Path: "
lines were given a prefix of "to:", "from:", "subj:", and "strn:"
respectively. In addition to that tagging, 0.15.4 tags tokens in
"Received: " with "rcvd:" and all other tokens in header lines are
tagged "head:".
So far, three users have reported their experiences of the effect of the
modified tagging:
In my environment all incoming messages are fed into the wordlists (by
'-u' with manual corrections). My testing indicates a small advantage
to the new tagging. In actual use, I've not noticed any difference -
either for good or ill.
Michael proposed the changes and reports a 4% improvement in spam
filtering. He originally proposed using "h:" for the prefix, but I
implemented "head:" for consistency with the other tags which are all 4
letters plus colon.
Greg uses train-on-error and has seen his false positive rate skyrocket.
It's so bad for him that he has requested a way to turn off the new
tagging. I've sent him two patches. The first implements '-H' to turn
off "head:" token tagging and the second implements simple degeneration
lookup, i.e. if "head:whatever" isn't found, look for "whatever". I
haven't yet heard from him which patch he prefers.
Question 1: Has anybody else noticed an effect from the new header
tagging? If so, what have you noticed?
Question 2: Have y'all a preference for "h:" vs. "head:"?
Question 3: Have y'all a preference for what '-H' should do?
Looking forward to hearing from you.
David
More information about the Bogofilter
mailing list