Re casefolding
David Relson
relson at osagesoftware.com
Wed May 14 14:09:29 CEST 2003
At 03:57 AM 5/14/03, michael at optusnet.com.au wrote:
>"Peter Bishop" <pgb at adelard.com> writes:
>[..]
> > False negative performance
> >
> > test train spams fn fn(with-caps)
> > 2.gz 3.gz 3876 19 14
> > 3.gx 2.gz 1907 9 5
> >
> > I am a bit suspicious about the first result as the
> > count of spams (as split up by formail) changed
> > from the unkludged version (increased by 6 to 3822)..
> >
> > The changes look they are just about significant
> > might expect a variation of 19+-4 to and 9+-3 from chance variation
>
>Peter, I suspect it's difficult to see how much your
>patch it actually affecting things owning to the small
>size of your database.
>
>Could you possibly post the patch for this? I'll run it
>again a more sizable corpus I have.
>
>Michael.
Michael,
Since 0.12.3, a group of parsing options have been added to bogofilter and
bogolexer. They're all toggles that enable/disable capabilities, i.e.
"-Pf" for case-folding
"-Ph" for tagging of header lines
"-Pt" for tokenizing of html tags
"-PC" for strict checking (of html comments)
If you update your source code, you'll be able to test case folding to your
heart's content!
I also have a parsing patch along the lines suggested by Paul Graham in
"Better Bayeesian Filtering" http://www.paulgraham.com/better.html. Greg
is testing the patch to measure its effectiveness.
David
More information about the Bogofilter
mailing list