New header token tagging

Thu Sep 25 18:36:54 CEST 2003

relson at osagesoftware.com said:
> Question 1:  Has anybody else noticed an effect from the new header
> tagging?  If so, what have you noticed? 

I don't think I've seen any difference.  I've only seen one FP in the 1906
ham messages I've sent through bogofilter (an FP rate of 0.05%) and my FN
rate has held steady around 15% (# FNs divided by number of spam).

relson at osagesoftware.com said:
> Question 2:  Have y'all a preference for "h:" vs. "head:"? 

I like "head:".  More readable.

relson at osagesoftware.com said:
> Question 3:  Have y'all a preference for what '-H' should do? 

Nope.  However, I think bogofilter should stick to one type of lexer.  
Most people will use the default and supporting many different lexers 
becomes quite tricky.

Have you considered spitting out two tokens for each header word?  I.e. 
for "Organization: Osage Software Systems, Inc.", it sounds like you 
currently produce

head:organization
head:osage
head:software
head:systems
head:inc

(or something like that).  Let me suggest that you also produce

organization
osage
software
systems
inc

>From my experience in text classification, it often works better to 
duplicate tokens like this rather than to fragment the feature space by 
adding prefixes like "head:", "to:", etc.

Jason