Fw: Mixed case token handling

Fri May 30 23:20:21 CEST 2003

Dave Lovelace wrote:

>Jef Poskanzer wrote:
>  
>
>>>This would help migration from a casefolded database as classification 
>>>algorithn would degenerate to the existing lower case method and 
>>>performance would be no worse than before. 
>>>      
>>>
>>I'm not 100% sure I'm following the discussion correctly, but
>>couldn't you also handle the migration issue with a little script
>>that dumps the database, duplicates all-lowercase tokens with
>>capitalized and all-uppercase versions, and makes a new db?
>>---
>>Jef
>>
>>         Jef Poskanzer  jef at acme.com  http://www.acme.com/jef/
>>
>>    
>>
>That would not suffice.  It would add "Spam" and "SPAM" but not "SPam",
>"sPam", "sPAm", "SPAm", "SpAm", ...
>And I personally don't think adding every variant on every token is what
>anyone would want.
>
>  
>
Especially since that would bloat the db from 1 token per word to 2^n 
tokens per word (where n is the word length).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.bogofilter.org/pipermail/bogofilter/attachments/20030530/ddd96769/attachment.htm>