Crm114 style context matching. Phrases and partial phrases.

Sat May 17 23:38:11 CEST 2003

At 05:23 PM 5/17/03, Greg Louis wrote:

>On 20030517 (Sat) at 1637:45 -0400, David Relson wrote:
>
> > Any single message won't have have W*W tokens.  The concern is more that,
> > over time, the overall corpora will use W*W tokens and give an overly 
> large
> > wordlist.
> >
>Yup.  To give you an idea:
>-rw-r--r--    1 spamtest users    19558400 May 17 09:25 spamlist.db
>-rw-r--r--    1 spamtest users   466722816 May 17 17:17 spamlist.db
>
>The first line is normal spamlist.db from 11,000 spams; ca 18 Mb.
>The second is from the same 11,000 spams with phrases, n=4; 455 Mb.
>
>This is why CRM114 trains only on errors; we'll have to learn how to
>manage the tradeoff between manageable training db size and superior
>discrimination, if phrases turn out to make a big difference.
>
>I'm about half way through building the corresponding goodlist.db; with
>luck, I should have the test results later this evening.

Sems like you already have a script for "train on error".  Have you 
considered putting randomtrain to work?