case folding [was: tuning ]

michael at optusnet.com.au michael at optusnet.com.au
Wed May 7 00:52:00 CEST 2003


David Relson <relson at osagesoftware.com> writes:
> At 01:37 PM 5/6/03, Joerg Over wrote:
> >In that line of thought: Why is case mangled in the database?
> >I'd believe that there _would_ be a difference and maybe greater
> >accuracy.
[...] 
> Case folding saves on database size.  Without it, you might have
> "money", "Money", "MONEY", "monEy", ...

Indeed. The other point that's frequently missed is that there's
a tradeoff between accuracy and training duration.

The higher the accuracy you want, the more clues you look at
(i.e. using the case as well as the spelling), the more data you need
for bogofilter to sensibly understand it. This in turn means that you
need a larger email/spam corpus to train on which will drive a larger
database.

Michael.




More information about the Bogofilter mailing list