Mixed case token handling

Fri May 30 23:23:53 CEST 2003

>On a more practical note, suppose an incoming message has "WORD", "Word", 
>and "word" and the wordlists have "Word" and "word". There will be a cost 
>in cpu cycles to do the various lookups and pick one. Assuming the 
>algorithm results in "word" being used for "WORD", I think it will be 
>necessary to trim the messages list _after_ lookups to remove duplications. 

Paul has an interesting footnote (7) in his newer article. He points out that instead of degenerating FOO to Foo or foo, he really should degenerate to something like Anycase*foo, which would represent the count for all occurrences of "foo" in any combination of case. 

Suppose bogofilter did this. To migrate my pre-0.13 databases, I could rebuild them, replacing every token with Anycase*token. From that point forward, I'd train using -PI. At most, bogofilter would need to do two lookups per token during classification (although of course training would also take a bit longer). 

One thing I like about this is that the same feature would aid both migration and general discrimination (if Paul is right). This makes me think that I'd rather wait for the full degeneration code to be added, rather than use a partial, migration-specific feature. In the meantime I'd just activate ignore_case in 0.13. 

As for the option's name, how about: 

  -Pd/-PD  case_degeneration 

Or, if the program supports Paul's full proposal: 

  -Pd/-PD token_degeneration 

Shawn