Fw: Mixed case token handling
Sam Hills
rcb at bbll.com
Fri May 30 23:20:21 CEST 2003
Dave Lovelace wrote:
>Jef Poskanzer wrote:
>
>
>>>This would help migration from a casefolded database as classification
>>>algorithn would degenerate to the existing lower case method and
>>>performance would be no worse than before.
>>>
>>>
>>I'm not 100% sure I'm following the discussion correctly, but
>>couldn't you also handle the migration issue with a little script
>>that dumps the database, duplicates all-lowercase tokens with
>>capitalized and all-uppercase versions, and makes a new db?
>>---
>>Jef
>>
>> Jef Poskanzer jef at acme.com http://www.acme.com/jef/
>>
>>
>>
>That would not suffice. It would add "Spam" and "SPAM" but not "SPam",
>"sPam", "sPAm", "SPAm", "SpAm", ...
>And I personally don't think adding every variant on every token is what
>anyone would want.
>
>
>
Especially since that would bloat the db from 1 token per word to 2^n
tokens per word (where n is the word length).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20030530/ddd96769/attachment.html>
More information about the Bogofilter
mailing list