Fw: Mixed case token handling

Dave Lovelace dave at firstcomp.biz
Fri May 30 21:45:45 CEST 2003


Jef Poskanzer wrote:
> 
> >This would help migration from a casefolded database as classification 
> >algorithn would degenerate to the existing lower case method and 
> >performance would be no worse than before. 
> 
> I'm not 100% sure I'm following the discussion correctly, but
> couldn't you also handle the migration issue with a little script
> that dumps the database, duplicates all-lowercase tokens with
> capitalized and all-uppercase versions, and makes a new db?
> ---
> Jef
> 
>          Jef Poskanzer  jef at acme.com  http://www.acme.com/jef/
> 
That would not suffice.  It would add "Spam" and "SPAM" but not "SPam",
"sPam", "sPAm", "SPAm", "SpAm", ...
And I personally don't think adding every variant on every token is what
anyone would want.

-- 
- Dave Lovelace
  dave at firstcomp.biz
  davel at cyberspace.org




More information about the Bogofilter mailing list