Crm114 style context matching. Phrases and partial phrases.
Peter Bishop
pgb at adelard.com
Sat May 17 21:19:15 CEST 2003
On 17 May 2003 at 7:22, Jef Poskanzer wrote:
> Neato. For N=2 the number of tokens only doubles, and I bet the
> sensitivity would still be significantly better than N-1.
It's hard to say what the increase will be
- the worst case is N * N
where N is the number of unique words
In practice it could be much less, as some sequences are ungrammatical
e.g. take the words:
cat sat on the mat
There are 5 unique words, but you are unlikely to see all 25 word pairs
e.g. You would not expect to see:
the sat
on sat
mat the
sat the
mat sat
the on
etc.
etc,
--
Peter Bishop
pgb at adelard.com
pgb at csr.city.ac.uk
More information about the Bogofilter
mailing list