Tracking metadata and other options (was: token degeneration)
David Relson
relson at osagesoftware.com
Tue Jul 29 21:13:09 CEST 2003
Jake,
A well worded, thoughtful response. Thank you. Now to fill you in on some
of the background...
Originally bogofilter was case insensitive. "FREE", "Free", and "free"
were all represented in the database by the single token "free". Then
along came Paul Graham's latest research report with the finding that being
case sensitive provides better results. We verified that that was true and
changed bogofilter's defaults.
The ideal thing to do at that point was use all your accumulated ham and
spam messages and rebuild the wordlists using the revised, case sensitive
bogofilter. Some people can't do this, for example ISP's, because they
don't have copies of the messages. At that point, the request was made for
implementing Graham's degeneration technique.
The upside of the technique is that new tokens that differ only in
capitalization (or prefixes or exclamation points) will be recognized. The
downside is that it takes time. My belief is that, given a case sensitive
wordlist, degeneration is unnecessary.
Anyhow, beliefs aside, the ability is now in bogofilter and will be in the
next release. If testing shows the technique is truly useful, it will
become the default mode.
Enough about degeneration...
Your second subject was command line switches and config file options.
Comparing bogofilter's help message and file bogofilter.cf.example shows
that many of the settable parameters can be set from both the command line
and the config file. For testing, command line switches are indeed very
useful, though I tend to prefer the config file. On the other hand, Greg
prefers the command line.
Presently, the command line is parsed using getopt() which limits switches
to single characters. At some point we might switch to getopt_long() for
support of "--longoptions". Time will tell :-)
David
More information about the Bogofilter
mailing list