Crm114 style context matching. Phrases and partial phrases.

Sat May 17 22:37:45 CEST 2003

At 03:29 PM 5/17/03, Peter Bishop wrote:

>On 17 May 2003 at 20:19, bogofilter at aotto.com wrote:
>
> > On 17 May 2003 at 7:22, Jef Poskanzer wrote:
> >
> > > Neato.  For N=2 the number of tokens only doubles, and I bet the
> > > sensitivity would still be significantly better than N-1.
> >
> > It's hard to say what the increase will be
> > - the worst case is  W * W
> > where W is the number of unique words
> >
>
>Actually, come to think of it, it is also unlikely because the mail
>message would have to be an extremely weird one where each word appears
>many times over in all possible conbinations.

Any single message won't have have W*W tokens.  The concern is more that, 
over time, the overall corpora will use W*W tokens and give an overly large 
wordlist.

By the way, I think the code for creating two word phrases is next to 
trivial.  My estimate for implementing a basic capability is that it'll 
take 5 or 10 lines of code.  I'd be inclined to keep a phrase within a 
single header line or the body of the message (or individual bodies in a 
multipart mime message).  That would add some additional code, though not a 
whole lot.