Use root words to reduce training time

Mon May 17 14:44:42 CEST 2004

On Sun, 16 May 2004 22:26:08 -0400
Kevin O'Connor wrote:

...[snip]...

> Keeping this extra info wouldn't be free - the database would get
> larger, and all token updates would need to also update their "root:"
> equivalent. On the upside, however, one could probably teach bogoutil
> to strip out all the wacky permutations of words that don't have
> probabilities that significantly deviate from their root.

Hi Kevin,

You've got some interesting ideas.  'Tis hard to say whether they're
right or not, but they could be right.  Implementing your idea and
testing the results would indicate if they're useful.

> I'd go work on a patch myself, but I'm not too clear on how to add the
> root information to the current statistical code.  Can anyone suggest
> a good way of adding this new information to the current algorithm?

In token.c there's function get_token().  Modifying that function to
return "token" and "root:token" shouldn't be too difficult.  

The function has a big switch statement and the code for IPADDR might be
closest to what you want to do.  Given an ip addr, it returns several
tokens for scoring -- which is basically what you're thinking about. 
All tokens returned by get_token() will end up in the wordlist when you
register the message.

As to removing wacky tokens, that is also doable.  Look at bogoutil's
maintenance functions and the "-p" option.  You can use the maint.c code
to loop through the wordlist and the probability code to find neutral
tokens.

Enjoy!

David