Re casefolding

David Relson relson at osagesoftware.com
Wed May 14 14:09:29 CEST 2003


At 03:57 AM 5/14/03, michael at optusnet.com.au wrote:

>"Peter Bishop" <pgb at adelard.com> writes:
>[..]
> > False negative performance
> >
> > test  train   spams   fn      fn(with-caps)
> > 2.gz  3.gz    3876    19      14
> > 3.gx  2.gz    1907    9       5
> >
> > I am a bit suspicious about the first result as the
> > count of spams (as split up by formail) changed
> > from the unkludged version (increased by 6 to 3822)..
> >
> > The changes look they are just about significant
> > might expect a variation of 19+-4 to and 9+-3 from chance variation
>
>Peter, I suspect it's difficult to see how much your
>patch it actually affecting things owning to the small
>size of your database.
>
>Could you possibly post the patch for this? I'll run it
>again a more sizable corpus I have.
>
>Michael.

Michael,

Since 0.12.3, a group of parsing options have been added to bogofilter and 
bogolexer.  They're all toggles that enable/disable capabilities, i.e.

"-Pf" for case-folding
"-Ph" for tagging of header lines
"-Pt" for tokenizing of html tags
"-PC" for strict checking (of html comments)

If you update your source code, you'll be able to test case folding to your 
heart's content!

I also have a parsing patch along the lines suggested by Paul Graham in 
"Better Bayeesian Filtering" http://www.paulgraham.com/better.html.  Greg 
is testing the patch to measure its effectiveness.

David







More information about the Bogofilter mailing list