token pairs [was: Algorithm limitations]

Wed Apr 14 13:39:51 CEST 2004

On Wed, 14 Apr 2004 08:47:19 +0200
Boris 'pi' Piwinger wrote:

> David Relson <relson at osagesoftware.com> wrote:
> 
> >Scanning the message happens as normal.  Each token seen is returned
> >for scoring.  Additionally, a token pair is created using each token
> >and its predecessor (with a colon separating them).
> 
> There is the risk that this creates a token pair which looks
> like a tagged entry, since both use a colon.
> 
> pi

True.  The odds are low, but it can happen.  One could use a leading
colon or a pair of colons, as in ":token:pair" and "token::pair", to
avoid the problem.  As there are only a few tags in use, I think the
risk of a collision is pretty low.  Also, since most tokens have fairly
neutral scores, getting an incorrect result because of a collision is
even smaller.