parsing options

David Relson relson at osagesoftware.com
Fri May 16 01:02:14 CEST 2003


At 12:08 PM 5/15/03, Dave Lovelace wrote:

>David Relson wrote, in part:
> >
> > It would be nice to have either the historical defaults correspond to all
> > upper case or all lower case.
> >
>
>Well, that raises another issue for us here.  For us, bogofilter has
>been working very well, though of course improvement would be welcome.
>(I haven't been doing the update-of-the-day or anything; we're still at
>0.11.1.8.)
>
>You keep saying that turning off case folding improves
>accuracy (and I'm not questioning that).  But we have fairly large
>databases (well, not by some people's standards) created by versions
>that did case folding.  I have a strong suspicion that updating to a
>current rev & turning off case folding will be a disaster in terms of
>accuracy, short term, as the lower/mixed case tokens are not in either
>database.  And the mail that generated those databases is
>by & large one with the snows of yesteryear.
>
>Is any improvement from turning off case folding going to be worth the
>hassle of retraining bogofilter over several weeks?

Dave,

I don't think you need to worry about retraining.  Think of case 
sensitivity as expanding the set of words that bogofilter knows.  Most 
words of most messages are lower case and bogofilter will process those 
words as before.  The remaining words contain capital letters which are not 
presently in your wordlists.  If you use autoupdating (the '-u') option, 
bogofilter will quickly come to associate those new words with spam (or 
ham), depending on the context they were received in.  If you only update 
the wordlists for false negatives and false positives, learning the new 
words will take longer.

On the other hand, you can continue to do case folding if you so 
desire.  Bogofilter will continue to have the ability to enable or disable 
that.  The default will likely change enable mixed case, but there will be 
a command line switch and a config file option to set that as you want it.

David






More information about the Bogofilter mailing list