case sensitivity [was: 16.2 not as effective]

Tue Jan 20 23:45:06 CET 2004

On Tue, 20 Jan 2004 17:30:30 -0000
"Peter Bishop" <pgb at adelard.com> wrote:

> On 20 Jan 2004 at 16:53, Geoff wrote:
> 
> > One reason for my "Ignore
> > Case" post a couple of days ago was the suspicion that the
> > loss of this option (which I have always used), was the
> > problem - but I don't know whether it would affect the
> > position so radically because I am unsure if it will
> > immediately impact upon my existing wordlist.db?
> 
> It does make a big difference.
> I tried moving to case sensitive mode,
> but the wordlist database was still case  insensitive
> (as I could not rebuild from scratch)
> 
> The performance went down a lot after I switched - as the mixed case 
> tokens no longer matched the case insensitve tokens in the database. 
> 
> Performance should in principle recover once you have enough
> mixed case tokens in the database, but I gave up trying
> after a few weeks and went back to case-insenstive mode.
> 
> I now have wordlists with redundant mixed case tokens.
> but no matter, this mode works OK for me.

Peter,

Right on!  There's a definite accuracy change when changing from case
insensitive to case sensitive mode.  At first, accuracy will suffer
because the "new" case sensitive words aren't in the database.  After a
while, accuracy will improve (over case insensitive scores) because
there's more information in the database and it's more specific as to
whether the token is hammish or spammish.

Because bogofilter's defaults didn't change from 0.15.13 to 0.16.0, I
was looking for other reasons for a change in accuracy.  FWIW, case
sensitivity has been the default behavior since last May when version
0.13.0 was released.

David