Use root words to reduce training time

Tue May 18 16:02:27 CEST 2004

On Tue, May 18, 2004 at 07:09:15AM -0400, David Relson wrote:
> On Tue, 18 May 2004 01:53:55 -0400
> Kevin O'Connor wrote:
> > The advantage of your way is that it is easier to implement and fits
> > in well with the rest of the code.  A possible disadvantage, however,
> > is that it could cause root tokens to overly influence the outcome.
> 
> Implementation and fit are commonly two of my goals :-)  "overly
> influence" indicates I don't understand your idea.  Can you explain more
> fully?

Right now, if the code finds a new token (like "subj:Foo!") the token will
be given the robX value for scoring purposes.  As in the 'x' in the
following formula:

n = badcount + goodcount
f(w) = (s * x + n * p(w)) / (s + n)

Since the token ("subj:Foo!") has never been seen before (n==0) bogofilter
just gives up and uses a default value.  However, on a closer look, one
might find that "foo" (in all its other permutations) is a large ham or
spam indicator.  In that case, using a neutral default is suboptimal.  I
was thinking of something kinda like:

n = badcount + goodcount
f(w) = (s * f(root(w)) + n * p(w)) / (s + n)

In high level terms, the above would basically say: "Score as is currently
done for all well known tokens, but for unknown and less known tokens
weight the probability towards that of the underlying root word."  The idea
is, if it is known that "spamword" (in its many case, header, and
punctuation permutations) is a high indicator of spam, then when
"subj:SpamWord!" is seen for the first time, bogofilter can score it as
spam without needing to train on that particular permutation.  On the other
hand, if the token "FOO" is common and is a high indicator of spam, then
the calculations should use that and not be affected if "foo" in other
permutations is neutral.

If I understand you correctly, you were suggesting a change to get_token
that would add both the existing and root tokens to the scoring list.  This
change would be easier, but it would give as much weight to the root token
as it does to the real token.  I had not thought of doing it this way, and
it may work out just as well.  However, my one concern is that the root
tokens would overly influence the overall outcome.  As an example, consider
a user named Fred - his token list might have a "root:fred" token that is
quite hammy because his name (in various permutations) is common in ham
emails.  However, "subj:Fred!" might be a high spam indicator.  In that
case the get_token change might return conflicting indicators (a hammy
"root:fred" and a spammy "subj:Fred!").  In this case I think the ideal
thing would be to ignore "root:fred" and just use "subj:Fred!".  Of course,
in practice this may not happen frequently enough to justify a change to
the statistics code.

I hope this better clarifies what I am thinking about.
-Kevin