Re casefolding

Peter Bishop pgb at adelard.com
Tue May 13 19:56:43 CEST 2003


I thought I would have another look at case folding
Same approach as before except I was more careful to preserve
the mail headers intact when I kludged the test files
+ I tried to capture all capitalisation of words in the database

This time every capital letter is followed by a hyphen
e.g. 
FREE  -> F-R-E-E-
Now -> N-ow

so they are stored in the database as f-r-e-e, n-ow
There is some risk that this will be the same as existing 
lower case words - but not much:

Increase in database tokens:

2.gz	42585	->	49211
3.gz	26565	->	31296
ham	12081	->	14683

False negative performance

test 	train	spams	fn	fn(with-caps)
2.gz	3.gz	3876	19	14
3.gx	2.gz	1907	 9	 5

I am a bit suspicious about the first result as the
count of spams (as split up by formail) changed
from the unkludged version (increased by 6 to 3822)..

The changes look they are just about significant
might expect a variation of 19+-4 to and 9+-3 from chance variation

So maybe capitals have some effect after all !

-- 
Peter Bishop 
pgb at adelard.com
pgb at csr.city.ac.uk






More information about the Bogofilter mailing list