token degeneration

Greg Louis glouis at dynamicro.on.ca
Sat May 31 01:15:39 CEST 2003


On 20030530 (Fri) at 1811:20 -0400, David Relson wrote:

> 1 - Is token degeneration a tool to turn on and leave on?  Or is it a 
> technique to transition from ignore case to mixed case?  I've been thinking 
> of it as a transitional tool, but I suspect it's meant as turn on and leave 
> on.

One needs carefully to turn it _off_ for training unless the anycase
degeneration is implemented, and if it is, to train both token (case
sensitive) and token (any case, aka folded).  See below.

> 2 - Two matching techniques are described - (a) search for a variety of 
> matches and use the one farthest from 0.5; and (b) create and maintain an 
> additional "Anywhere*foo" token with cumulative ham/spam statistics that is 
> updated when any form of "foo" in encountered.  Which is preferred?

With (a) one would search for all matches, case-insensitively, and
compare the individual scores.  That would be much more efficient than
permuting the token and doing individual searches, though it might
require a biggish change in internal represenatation in the wordlists. 
As for "which is preferred?", that's the wrong question: the question
is, which (if either) contributes enough to be worth its overhead? 
(Admittedly, people with different email volumes may have differing
views about "worth".)

> Degenerating from uppercase to any-case is not feasible as there's no easy 
> way to query BerkeleyDB to find that "FREE" is matched by "FrEe" and "fREE".

Store the token both as is and case-folded, with links?

> "Anywhere*foo" is an additional token that is updated when any form of 
> "foo" is encountered.  It seems that this would save the multiple lookups 
> (described above)

Not if I read Paul correctly; it would be a last-resort thing.  He
wants to "keep track of statistics for ``foo'' overall as well as
specific versions, and degenerate from ``Subject*foo'' not to ``foo'' but
to ``Anywhere*foo''."  You still have to go through the specific
versions.

> and would significantly increase wordlist size.

I think the specific versions would have a greater impact, but I also
think that the impact might prove tolerable.  That, however, needs to
be a matter for experimentation, not speculation.

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |




More information about the Bogofilter mailing list