New header token tagging
Jason Rennie
jrennie at ai.mit.edu
Thu Sep 25 18:36:54 CEST 2003
relson at osagesoftware.com said:
> Question 1: Has anybody else noticed an effect from the new header
> tagging? If so, what have you noticed?
I don't think I've seen any difference. I've only seen one FP in the 1906
ham messages I've sent through bogofilter (an FP rate of 0.05%) and my FN
rate has held steady around 15% (# FNs divided by number of spam).
relson at osagesoftware.com said:
> Question 2: Have y'all a preference for "h:" vs. "head:"?
I like "head:". More readable.
relson at osagesoftware.com said:
> Question 3: Have y'all a preference for what '-H' should do?
Nope. However, I think bogofilter should stick to one type of lexer.
Most people will use the default and supporting many different lexers
becomes quite tricky.
Have you considered spitting out two tokens for each header word? I.e.
for "Organization: Osage Software Systems, Inc.", it sounds like you
currently produce
head:organization
head:osage
head:software
head:systems
head:inc
(or something like that). Let me suggest that you also produce
organization
osage
software
systems
inc
>From my experience in text classification, it often works better to
duplicate tokens like this rather than to fragment the feature space by
adding prefixes like "head:", "to:", etc.
Jason
More information about the Bogofilter
mailing list