token pairs [was: Algorithm limitations]
David Relson
relson at osagesoftware.com
Wed Apr 14 02:52:46 CEST 2004
On Tue, 13 Apr 2004 14:12:21 +0200
Boris 'pi' Piwinger wrote:
> David Relson wrote:
>
> > I'm not willing to include word pairs until after the 1.0 release,
> > but am willing to let users experiment with the technique. Attached
> > is a patch from a couple of months ago and updated to work with
> > 0.17.5. Below is a sample of the output using it:
> >
> > [relson at osage src]$ echo this is a test of word pairs | bogofilter
> > -C -H-vvv
>
> > [relson at osage src]$ echo this is a test of word pairs | bogofilter
> > -C -H-vvv -P
>
> >From that I understand that you need to call -P to make use
> of the feature. Could you or someone else please give a
> brief explanation which pairs are chosen? Is it only
> adjacent tokens (in your example the short words are not
> tokens) or can you jump over a word?
Scanning the message happens as normal. Each token seen is returned for
scoring. Additionally, a token pair is created using each token and its
predecessor (with a colon separating them). When bogofilter's parsing
changes the new header prefix, or changes between header and body modes,
a token pair isn't created.
More information about the Bogofilter
mailing list