Re casefolding

David Relson relson at osagesoftware.com
Thu May 15 03:23:56 CEST 2003


At 07:34 PM 5/14/03, michael at optusnet.com.au wrote:

>David Relson <relson at osagesoftware.com> writes:
> >
> > Michael,
> >
> > Since 0.12.3, a group of parsing options have been added to bogofilter
> > and bogolexer.  They're all toggles that enable/disable capabilities,
> > i.e.
> >
> > "-Pf" for case-folding
> > "-Ph" for tagging of header lines
> > "-Pt" for tokenizing of html tags
> > "-PC" for strict checking (of html comments)
> >
> > If you update your source code, you'll be able to test case folding to
> > your heart's content!
>
>Excellent. :)

Greg and I have been discussing and testing today - the subject being the 
new parsing options.  Rather than have have the parsing switches toggle 
bogofilter's default state, we think it's better to have enable and disable 
values for each parsing option.  Thus "-P" is followed by one or more 
letters from the following:

'c' (or 'C') to disable (or enable) strict comment checking ("<!--" ... 
"-->" vs "<!" ... ">)
'f' (or 'F') to disable (or enable) case folding (upper to lower)
'h' (or 'H') to disable (or enable) header line tagging
't' (or 'T') to disable (or enable) parsing of html tags A, FONT, and IMG

His tests and mine agree that _not_ folding case is good and that tagging 
header line is good.  My tests show that parsing the innards of A, FONT, 
and IMG tags is _very_ good.  His tests indicate parsing those innards is 
moot or slightly bad.

Given these results, the defaults are going to be "-PfHt" for now.  If 
additional testing testing shows that 'T' is more often good than bad, this 
is likely change.  (Hopefully I've written what I intended.  Today I've 
been confusing a lot of my 0's and 1's, yes's and no's, ...)

>Note that NEWS-0.12 on the sourceforge download page doesn't seem to
>have been updated for 0.12.3 and there's a stray debug fprintf() in
>main.c that's printing out the name of every input file in bulk mode!

Eh?  I just downloaded NEWS-0.12 and it has "0.12.3 2003-05-10" as the 
second line.  What are you seeing?

Bulk mode prints out both filename _and_ spamicity message.  That was in 
the initial implementation.  The purpose is to identify the file to which 
the score applies.





More information about the Bogofilter mailing list