case folding [was: tuning ]

Peter Bishop pgb at adelard.com
Thu May 8 19:56:28 CEST 2003


Interesting.
So it would appear that:
1)  removal of casefolding is not expensive in terms of database size
2) accuracy is improved (though we can debate how much)
Seems like this is worth some more in-depth experiments on performance
e.g. how about using some more (and different) files from spamarchive.org
for testing spam detection performance?

On 8 May 2003 at 18:13, Joerg Over wrote:

> So, database size and token count don't increase a lot; indeed a
> lot less than I expected.
> 
> Now for the accuracy. This is the hard part. I tested the big
> databases with and without case mangling against my little spam
> collection which was -not- part of the generated spam databases.
> Results with -g were identical.
> Results with -r show an increase in spamicity of between 5% and
> 20%.
> There's 1 fn with mangling, 0 fn without.
> Results with -f show - well, I'm in war with fisher-graham.
> With case mangling I get 30 fn in my 33 spam-mails.
> Without case mangling I get 2 fn in my 33 spams.
> 

-- 
Peter Bishop 
Adelard and Centre for Software Reliability, City University
Drysdale Building, 10 Northampton Square, London, EC1V 0HB
Tel: +44-20-7490-9467, Fax: +44-20-7490-9451
pgb at adelard.com, http://www.adelard.com/
pgb at csr.city.ac.uk, http://www.city.ac.uk/





More information about the Bogofilter mailing list