Use root words to reduce training time

Mon May 17 04:26:08 CEST 2004

Hi,

Right now bogofilter stores tokens that are case sensitive, header
sensitive, and have punctuation marks embedded in them.  This is a great
way to make bogofilter even more accurate on well trained installations,
but it makes the training time take longer for new installations.  As an
example, when a new user trains, the token "spamword" may be recognized as
spam, but there is a good chance "Spamword" or "spamword!" or
"subj:Spamword" or even "SPAMWORD" wont be.  When these new tokens are
parsed they will receive the robx value, and wont be an indicator of spam
until the user does enough training so that all the common permutations of
"spamword" are registered.

I'd like to try having bogofilter store a set of root words along with its
current list of tokens.  Thus, when the parser found "SpAmWoRd" for the
first time, it would lookup "root:spamword" and use that instead of just
the robX value.  To start with, I'd define the root of a word as just the
case insensitive token with punctuation and header stuff stripped out.
(One could imagine more advanced algorithms, but I'll leave that for
later.)

Keeping this extra info wouldn't be free - the database would get larger,
and all token updates would need to also update their "root:" equivalent.
On the upside, however, one could probably teach bogoutil to strip out all
the wacky permutations of words that don't have probabilities that
significantly deviate from their root.

I'd go work on a patch myself, but I'm not too clear on how to add the root
information to the current statistical code.  Can anyone suggest a good way
of adding this new information to the current algorithm?

Comments?
-Kevin