token degeneration
Greg Louis
glouis at dynamicro.on.ca
Sat May 31 01:15:39 CEST 2003
On 20030530 (Fri) at 1811:20 -0400, David Relson wrote:
> 1 - Is token degeneration a tool to turn on and leave on? Or is it a
> technique to transition from ignore case to mixed case? I've been thinking
> of it as a transitional tool, but I suspect it's meant as turn on and leave
> on.
One needs carefully to turn it _off_ for training unless the anycase
degeneration is implemented, and if it is, to train both token (case
sensitive) and token (any case, aka folded). See below.
> 2 - Two matching techniques are described - (a) search for a variety of
> matches and use the one farthest from 0.5; and (b) create and maintain an
> additional "Anywhere*foo" token with cumulative ham/spam statistics that is
> updated when any form of "foo" in encountered. Which is preferred?
With (a) one would search for all matches, case-insensitively, and
compare the individual scores. That would be much more efficient than
permuting the token and doing individual searches, though it might
require a biggish change in internal represenatation in the wordlists.
As for "which is preferred?", that's the wrong question: the question
is, which (if either) contributes enough to be worth its overhead?
(Admittedly, people with different email volumes may have differing
views about "worth".)
> Degenerating from uppercase to any-case is not feasible as there's no easy
> way to query BerkeleyDB to find that "FREE" is matched by "FrEe" and "fREE".
Store the token both as is and case-folded, with links?
> "Anywhere*foo" is an additional token that is updated when any form of
> "foo" is encountered. It seems that this would save the multiple lookups
> (described above)
Not if I read Paul correctly; it would be a last-resort thing. He
wants to "keep track of statistics for ``foo'' overall as well as
specific versions, and degenerate from ``Subject*foo'' not to ``foo'' but
to ``Anywhere*foo''." You still have to go through the specific
versions.
> and would significantly increase wordlist size.
I think the specific versions would have a greater impact, but I also
think that the impact might prove tolerable. That, however, needs to
be a matter for experimentation, not speculation.
--
| G r e g L o u i s | gpg public key: finger |
| http://www.bgl.nu/~glouis | glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |
More information about the Bogofilter
mailing list