Fw: Mixed case token handling
David Relson
relson at osagesoftware.com
Fri May 30 22:29:56 CEST 2003
At 03:45 PM 5/30/03, Dave Lovelace wrote:
>Jef Poskanzer wrote:
> >
> > >This would help migration from a casefolded database as classification
> > >algorithn would degenerate to the existing lower case method and
> > >performance would be no worse than before.
> >
> > I'm not 100% sure I'm following the discussion correctly, but
> > couldn't you also handle the migration issue with a little script
> > that dumps the database, duplicates all-lowercase tokens with
> > capitalized and all-uppercase versions, and makes a new db?
> > ---
> > Jef
> >
> > Jef Poskanzer jef at acme.com http://www.acme.com/jef/
> >
>That would not suffice. It would add "Spam" and "SPAM" but not "SPam",
>"sPam", "sPAm", "SPAm", "SpAm", ...
>And I personally don't think adding every variant on every token is what
>anyone would want.
A script to extend "word" to "WORD" and "Word" would triple the wordlist
size. It's not obvious whether all the new tokens would ever be used, or
not. Then, as Jef points out, there are the other 2^n forms of an n-letter
word. I don't think I want to travel that route.
On a more practical note, suppose an incoming message has "WORD", "Word",
and "word" and the wordlists have "Word" and "word". There will be a cost
in cpu cycles to do the various lookups and pick one. Assuming the
algorithm results in "word" being used for "WORD", I think it will be
necessary to trim the messages list _after_ lookups to remove duplications.
Anybody want to tackle degeneration as Paul Graham describes it in "Better
Bayesian Filtering" http://www.paulgraham.com/better.html ?
David
More information about the Bogofilter
mailing list