Crm114 style context matching. Phrases and partial phrases.
David Relson
relson at osagesoftware.com
Sat May 17 22:37:45 CEST 2003
At 03:29 PM 5/17/03, Peter Bishop wrote:
>On 17 May 2003 at 20:19, bogofilter at aotto.com wrote:
>
> > On 17 May 2003 at 7:22, Jef Poskanzer wrote:
> >
> > > Neato. For N=2 the number of tokens only doubles, and I bet the
> > > sensitivity would still be significantly better than N-1.
> >
> > It's hard to say what the increase will be
> > - the worst case is W * W
> > where W is the number of unique words
> >
>
>Actually, come to think of it, it is also unlikely because the mail
>message would have to be an extremely weird one where each word appears
>many times over in all possible conbinations.
Any single message won't have have W*W tokens. The concern is more that,
over time, the overall corpora will use W*W tokens and give an overly large
wordlist.
By the way, I think the code for creating two word phrases is next to
trivial. My estimate for implementing a basic capability is that it'll
take 5 or 10 lines of code. I'd be inclined to keep a phrase within a
single header line or the body of the message (or individual bodies in a
multipart mime message). That would add some additional code, though not a
whole lot.
More information about the Bogofilter
mailing list