token degeneration

Sat May 31 00:11:20 CEST 2003

Hi,

Paul Graham's latest article "Better Bayesian Filtering", 
http://www.paulgraham.com/better.html, is the basis of the recent parsing 
changes in bogofilter.  Now that we have those capabilities the question of 
migrating from case insensitive wordlists to case sensitive wordlists has 
come up.  He describes a technique called "token degeneration" for finding 
a match to a token (when an exact match isn't found).  I have read (and 
reread) his comments on token degeneration, a.k.a. how to find a known 
token that is similar to an unknown token.  Below are some notes on what I 
think the process requires (and their repercussions).

Before I give the notes, there are some questions to raise:

1 - Is token degeneration a tool to turn on and leave on?  Or is it a 
technique to transition from ignore case to mixed case?  I've been thinking 
of it as a transitional tool, but I suspect it's meant as turn on and leave on.

2 - Two matching techniques are described - (a) search for a variety of 
matches and use the one farthest from 0.5; and (b) create and maintain an 
additional "Anywhere*foo" token with cumulative ham/spam statistics that is 
updated when any form of "foo" in encountered.  Which is preferred?

Discussion is encouraged of the questions above and the interpretations 
below.  I'm not 100% certain that my interpretation is true to the article.

Now for the info from the article ...

First, a quote from the article:

Here are the alternatives [7] considered if the filter sees ``FREE'' in the 
Subject line and doesn't have a probability for it.

Subject*FREE!!! Subject*Free!!! Subject*free!!!
Subject*FREE! Subject*Free! Subject*free!
Subject*FREE Subject*Free Subject*free
FREE!!! Free!!! free!!!
FREE! Free! free!
FREE Free free

As can be seen, a nonmatching token (with capitals and exclamation points) 
will cause an additional 17 lookups (if from a header line) or 8 lookups 
(if not in a header).  If the token isn't from a header line, or doesn't 
have any capital letters, or doesn't have exclamation points, there will be 
less work.

As I interpret the process, the 18 lookups are:

1 - Exact match
2 - All upper case to 1 leading cap
3 - 1 leading cap (or mixed case) to all lower case
4,5,6 - Multiple trailing exclamation points to 1 (then 1, 2, 3)
7,8,9 - Remove trailing exclamation point (then 1, 2, 3)
10-18 - Remove leading context and try (then 1...9)

Given more than one match, use value farthest from 0.5 for unmatched token.

Additionally, there's footnote [7], which points out that if the 18 lookups 
(above) find more than 1 match, the right thing to do is to use 
"cumulative" statistics and the right way to do that is with an auxiliary 
token "Anywhere*foo".

Here're some thoughts on that:

Degenerating from uppercase to any-case is not feasible as there's no easy 
way to query BerkeleyDB to find that "FREE" is matched by "FrEe" and "fREE".

"Anywhere*foo" is an additional token that is updated when any form of 
"foo" is encountered.  It seems that this would save the multiple lookups 
(described above) and would significantly increase wordlist size.