Crm114 style context matching. Phrases and partial phrases.

Peter Bishop pgb at adelard.com
Sat May 17 21:19:15 CEST 2003


On 17 May 2003 at 7:22, Jef Poskanzer wrote:

> Neato.  For N=2 the number of tokens only doubles, and I bet the
> sensitivity would still be significantly better than N-1.

It's hard to say what the increase will be 
- the worst case is  N * N 
where N is the number of unique words

In practice it could be much less, as some sequences are ungrammatical

e.g. take the words:

cat sat on the mat

There are 5 unique words, but you are unlikely to see all 25 word pairs

e.g. You would not expect to see:

the sat
on sat
mat the
sat the
mat sat
the on

etc.



etc, 
-- 
Peter Bishop 
pgb at adelard.com
pgb at csr.city.ac.uk






More information about the Bogofilter mailing list