Fw: Mixed case token handling

Fri May 30 22:29:56 CEST 2003

At 03:45 PM 5/30/03, Dave Lovelace wrote:
>Jef Poskanzer wrote:
> >
> > >This would help migration from a casefolded database as classification
> > >algorithn would degenerate to the existing lower case method and
> > >performance would be no worse than before.
> >
> > I'm not 100% sure I'm following the discussion correctly, but
> > couldn't you also handle the migration issue with a little script
> > that dumps the database, duplicates all-lowercase tokens with
> > capitalized and all-uppercase versions, and makes a new db?
> > ---
> > Jef
> >
> >          Jef Poskanzer  jef at acme.com  http://www.acme.com/jef/
> >
>That would not suffice.  It would add "Spam" and "SPAM" but not "SPam",
>"sPam", "sPAm", "SPAm", "SpAm", ...
>And I personally don't think adding every variant on every token is what
>anyone would want.

A script to extend "word" to "WORD" and "Word" would triple the wordlist 
size.  It's not obvious whether all the new tokens would ever be used, or 
not.  Then, as Jef points out, there are the other 2^n forms of an n-letter 
word.  I don't think I want to travel that route.

On a more practical note, suppose an incoming message has "WORD", "Word", 
and "word" and the wordlists have "Word" and "word".  There will be a cost 
in cpu cycles to do the various lookups and pick one.  Assuming the 
algorithm results in "word" being used for "WORD", I think it will be 
necessary to trim the messages list _after_ lookups to remove duplications.

Anybody want to tackle degeneration as Paul Graham describes it in "Better 
Bayesian Filtering" http://www.paulgraham.com/better.html ?

David