Crm114 style context matching. Phrases and partial phrases.

Sun May 18 13:12:18 CEST 2003

On 20030517 (Sat) at 1839:11 -0400, Greg Louis wrote:

> Right now my priority is to discover whether there is or is not a major
> improvement in discrimination with the introduction of phrases.  If
> there be such, then it's worth learning how to obtain it at the least
> cost.  If not, then we needn't bother optimizing.  I'm therefore not
> training on error, but just building a training db with 11,000 spams
> and 11,000 nonspams, not worrying about efficiency at this stage.  I
> expect to finish training and start on testing within the next 60
> minutes; I am hoping the testing will go much faster than the training
> did.
> 
The first experiment failed, at least in part because of a human error.
A quick check shows that the method used to build the training
databases was flawed: .MSG-COUNT came out zero (I should have realized
that would happen).  I will need to dump the (huge) training db and set
.MSG-COUNT manually, then reload (should cut the size a bit too).

Database size is a _major_ potential problem.  With PIPE_SIZE 4, the
spamlist grows by a factor of about 25 and the goodlist by about 10 in
comparison to the lists built with just single tokens.  Total size,
with 11,000 spams and 11,000 nonspams, is about 3/4 Gb.

I'll report back when the training db has been fixed and the tests
rerun.

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |