New header token tagging

Thu Sep 25 18:56:29 CEST 2003

Jason,

Thanks for responding.

On Thu, 25 Sep 2003 12:36:54 -0400
Jason Rennie <jrennie at ai.mit.edu> wrote:

> 
> relson at osagesoftware.com said:
> > Question 1:  Has anybody else noticed an effect from the new header
> > tagging?  If so, what have you noticed? 
> 
> I don't think I've seen any difference.  I've only seen one FP in the
> 1906 ham messages I've sent through bogofilter (an FP rate of 0.05%)
> and my FN rate has held steady around 15% (# FNs divided by number of
> spam).
> 
> relson at osagesoftware.com said:
> > Question 2:  Have y'all a preference for "h:" vs. "head:"? 
> 
> I like "head:".  More readable.

I agree on the readability.  Ideally one never has to see what's
actually happening.

> > relson at osagesoftware.com said:
> > Question 3:  Have y'all a preference for what '-H' should do? 
> 
> Nope.  However, I think bogofilter should stick to one type of lexer. 
> 
> Most people will use the default and supporting many different lexers 
> becomes quite tricky.

I agree -- kind of ...  There's a variety of environments in which
bogofilter is used (single user, small count (2-10), medium (11-99),
large (100 and up)) and there are different ways of training -
everything, train-on-error, unsures, etc.  So there's no single, best
answer.  However the goal is to have good defaults, along with options
for those with special needs.

One can surmise that the number of future users will exceed the number
of current users.  On that basis, bogofilter's defaults should all be
oriented to doing the best it can and flags like '-H' are solely useful
as aids to people upgrading from an old version to a new version.

>  Have you considered spitting out two tokens for each header word? 
>  I.e. 
> for "Organization: Osage Software Systems, Inc.", it sounds like you 
> currently produce

...[snip]...

Interesting idea, but me thinks it's not a good one.  The effect is to
count each of the words twice, which is a bias I find unacceptable.

> From my experience in text classification, it often works better to 
> duplicate tokens like this rather than to fragment the feature space
> by adding prefixes like "head:", "to:", etc.

"Fragmenting" has the advantage of identifying when "osage" and
"software" are used in different contexts, like To, From, Subject, body,
other header, ...  The value of doing this is supported by a variety of
experiments as well as user reports.  

Peace,

David