Mixed case token handling

Peter Bishop pgb at adelard.com
Fri May 30 16:41:19 CEST 2003


I notice that on the Paul Graham web site, he employs "degeneration" 
strategy when dealing with mixed case tokens.

So if you cannot find "FREE" you look for close equivalents "Free" "free" 
and take the highest spam rated version.

This would help migration from a casefolded database as classification 
algorithn would degenerate to the existing lower case method and 
performance would be no worse than before.

Paul Grahams degeneration rules are quite complex but perhaps it is 
sufficient to just casefold mixed case tokens if the mixed case on does not 
exist and look for that instead-- 
Peter Bishop 
Adelard and Centre for Software Reliability, City University
Drysdale Building, 10 Northampton Square, London, EC1V 0HB
Tel: +44-20-7490-9467, Fax: +44-20-7490-9451
pgb at adelard.com, http://www.adelard.com/
pgb at csr.city.ac.uk, http://www.city.ac.uk/





More information about the Bogofilter mailing list