Tracking metadata and other options (was: token degeneration)

David Relson relson at osagesoftware.com
Tue Jul 29 21:13:09 CEST 2003


Jake,

A well worded, thoughtful response.  Thank you.  Now to fill you in on some 
of the background...

Originally bogofilter was case insensitive.  "FREE", "Free", and "free" 
were all represented in the database by the single token "free".  Then 
along came Paul Graham's latest research report with the finding that being 
case sensitive provides better results.  We verified that that was true and 
changed bogofilter's defaults.

The ideal thing to do at that point was use all your accumulated ham and 
spam messages and rebuild the wordlists using the revised, case sensitive 
bogofilter.  Some people can't do this, for example ISP's, because they 
don't have copies of the messages.  At that point, the request was made for 
implementing Graham's degeneration technique.

The upside of the technique is that new tokens that differ only in 
capitalization (or prefixes or exclamation points) will be recognized.  The 
downside is that it takes time.  My belief is that, given a case sensitive 
wordlist, degeneration is unnecessary.

Anyhow, beliefs aside, the ability is now in bogofilter and will be in the 
next release.  If testing shows the technique is truly useful, it will 
become the default mode.

Enough about degeneration...

Your second subject was command line switches and config file options.

Comparing bogofilter's help message and file bogofilter.cf.example shows 
that many of the settable parameters can be set from both the command line 
and the config file.  For testing, command line switches are indeed very 
useful, though I tend to prefer the config file.  On the other hand, Greg 
prefers the command line.

Presently, the command line is parsed using getopt() which limits switches 
to single characters.  At some point we might switch to getopt_long() for 
support of "--longoptions".  Time will tell :-)

David





More information about the Bogofilter mailing list