token pairs [was: Algorithm limitations]

Wed Apr 14 02:52:46 CEST 2004

On Tue, 13 Apr 2004 14:12:21 +0200
Boris 'pi' Piwinger wrote:

> David Relson wrote:
> 
> > I'm not willing to include word pairs until after the 1.0 release,
> > but am willing to let users experiment with the technique.  Attached
> > is a patch from a couple of months ago and updated to work with
> > 0.17.5. Below is a sample of the output using it:
> > 
> > [relson at osage src]$ echo this is a test of word pairs | bogofilter
> > -C -H-vvv
> 
> > [relson at osage src]$ echo this is a test of word pairs | bogofilter
> > -C -H-vvv -P
> 
> >From that  I understand that you need to call -P to make use
> of the feature. Could you or someone else please give a
> brief explanation which pairs are chosen? Is it only
> adjacent tokens (in your example the short words are not
> tokens) or can you jump over a word? 

Scanning the message happens as normal.  Each token seen is returned for
scoring.  Additionally, a token pair is created using each token and its
predecessor (with a colon separating them).  When bogofilter's parsing
changes the new header prefix, or changes between header and body modes,
a token pair isn't created.