<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">
<title></title>
</head>
<body>
<br>
<br>
Dave Lovelace wrote:<br>
<blockquote type="cite" cite="mid200305301945.PAA11010@firstcomp.biz">
<pre wrap="">Jef Poskanzer wrote:
</pre>
<blockquote type="cite">
<blockquote type="cite">
<pre wrap="">This would help migration from a casefolded database as classification
algorithn would degenerate to the existing lower case method and
performance would be no worse than before.
</pre>
</blockquote>
<pre wrap="">I'm not 100% sure I'm following the discussion correctly, but
couldn't you also handle the migration issue with a little script
that dumps the database, duplicates all-lowercase tokens with
capitalized and all-uppercase versions, and makes a new db?
---
Jef
Jef Poskanzer <a class="moz-txt-link-abbreviated" href="mailto:jef@acme.com">jef@acme.com</a> <a class="moz-txt-link-freetext" href="http://www.acme.com/jef/">http://www.acme.com/jef/</a>
</pre>
</blockquote>
<pre wrap=""><!---->That would not suffice. It would add "Spam" and "SPAM" but not "SPam",
"sPam", "sPAm", "SPAm", "SpAm", ...
And I personally don't think adding every variant on every token is what
anyone would want.
</pre>
</blockquote>
Especially since that would bloat the db from 1 token per word to 2^n tokens
per word (where n is the word length).<br>
</body>
</html>