token degeneration

Sat May 31 03:34:47 CEST 2003

At 07:15 PM 5/30/03, Greg Louis wrote:
>On 20030530 (Fri) at 1811:20 -0400, David Relson wrote:
>
> > 1 - Is token degeneration a tool to turn on and leave on?  Or is it a
> > technique to transition from ignore case to mixed case?  I've been 
> thinking
> > of it as a transitional tool, but I suspect it's meant as turn on and 
> leave
> > on.
>
>One needs carefully to turn it _off_ for training unless the anycase
>degeneration is implemented, and if it is, to train both token (case
>sensitive) and token (any case, aka folded).  See below.

Training can be complicated when Anycase is supported.  Ideally 
Anycase*token only exists if there are two variants of token in the 
wordlist.  Given a token with which to train, the wordlist may have an 
exact match, an inexact match, and/or Anycase*token.  Update all that 
apply.  It may be that there's only an inexact match in which case both an 
exact match and Anycase*token get created.  Question:  can BerkeleyDB do a 
case-insensitive search??

> > 2 - Two matching techniques are described - (a) search for a variety of
> > matches and use the one farthest from 0.5; and (b) create and maintain an
> > additional "Anywhere*foo" token with cumulative ham/spam statistics 
> that is
> > updated when any form of "foo" in encountered.  Which is preferred?
>
>With (a) one would search for all matches, case-insensitively, and
>compare the individual scores.  That would be much more efficient than
>permuting the token and doing individual searches, though it might
>require a biggish change in internal represenatation in the wordlists.
>As for "which is preferred?", that's the wrong question: the question
>is, which (if either) contributes enough to be worth its overhead?
>(Admittedly, people with different email volumes may have differing
>views about "worth".)

Given Paul's degeneration rules, token "fReE" could exist and wouldn't 
match any of "FREE", "Free", or "free".  Alternatively, given "fReE" to 
match, the algorithm would try all lower case and match it with "free".

> > Degenerating from uppercase to any-case is not feasible as there's no easy
> > way to query BerkeleyDB to find that "FREE" is matched by "FrEe" and 
> "fREE".
>
>Store the token both as is and case-folded, with links?

Yikes!  Links scare me.  They're hard to maintain and they break.

> > "Anywhere*foo" is an additional token that is updated when any form of
> > "foo" is encountered.  It seems that this would save the multiple lookups
> > (described above)
>
>Not if I read Paul correctly; it would be a last-resort thing.  He
>wants to "keep track of statistics for ``foo'' overall as well as
>specific versions, and degenerate from ``Subject*foo'' not to ``foo'' but
>to ``Anywhere*foo''."  You still have to go through the specific
>versions.

I interpret as Anywhere*foo as the easy fallback when there's no exact 
match.  It will be retrieved whenever a new case-sensitive variant is 
encountered.  For training, the new case-sensitive variant will create a 
new token (and will update Anywhere*foo).

> > and would significantly increase wordlist size.
>
>I think the specific versions would have a greater impact, but I also
>think that the impact might prove tolerable.  That, however, needs to
>be a matter for experimentation, not speculation.

So, the conclusion seems to be "implement the bells and whistles, test, 
throw away what's not useful."

Before that happens, I want to be clear on what we're going to test.