Re casefolding
Peter Bishop
pgb at adelard.com
Tue May 13 19:56:43 CEST 2003
I thought I would have another look at case folding
Same approach as before except I was more careful to preserve
the mail headers intact when I kludged the test files
+ I tried to capture all capitalisation of words in the database
This time every capital letter is followed by a hyphen
e.g.
FREE -> F-R-E-E-
Now -> N-ow
so they are stored in the database as f-r-e-e, n-ow
There is some risk that this will be the same as existing
lower case words - but not much:
Increase in database tokens:
2.gz 42585 -> 49211
3.gz 26565 -> 31296
ham 12081 -> 14683
False negative performance
test train spams fn fn(with-caps)
2.gz 3.gz 3876 19 14
3.gx 2.gz 1907 9 5
I am a bit suspicious about the first result as the
count of spams (as split up by formail) changed
from the unkludged version (increased by 6 to 3822)..
The changes look they are just about significant
might expect a variation of 19+-4 to and 9+-3 from chance variation
So maybe capitals have some effect after all !
--
Peter Bishop
pgb at adelard.com
pgb at csr.city.ac.uk
More information about the Bogofilter
mailing list