Fw: Mixed case token handling
Dave Lovelace
dave at firstcomp.biz
Fri May 30 21:45:45 CEST 2003
Jef Poskanzer wrote:
>
> >This would help migration from a casefolded database as classification
> >algorithn would degenerate to the existing lower case method and
> >performance would be no worse than before.
>
> I'm not 100% sure I'm following the discussion correctly, but
> couldn't you also handle the migration issue with a little script
> that dumps the database, duplicates all-lowercase tokens with
> capitalized and all-uppercase versions, and makes a new db?
> ---
> Jef
>
> Jef Poskanzer jef at acme.com http://www.acme.com/jef/
>
That would not suffice. It would add "Spam" and "SPAM" but not "SPam",
"sPam", "sPAm", "SPAm", "SpAm", ...
And I personally don't think adding every variant on every token is what
anyone would want.
--
- Dave Lovelace
dave at firstcomp.biz
davel at cyberspace.org
More information about the Bogofilter
mailing list