Crm114 style context matching. Phrases and partial phrases.

Mon May 19 01:14:41 CEST 2003

On 20030518 (Sun) at 0712:18 -0400, Greg Louis wrote:

> The first experiment failed, at least in part because of a human error.
> A quick check shows that the method used to build the training
> databases was flawed: .MSG-COUNT came out zero (I should have realized
> that would happen).  I will need to dump the (huge) training db and set
> .MSG-COUNT manually, then reload (should cut the size a bit too).

Red herring, ENOTENOUGHCOFFEE.  The proper pseudo-token name is
.MSG_COUNT, with an underscore.  The reasons for failure were a bit
more complicated, but I'm happy to say I now have what I believe is a
valid comparison.

A bit of explanation of the experimentation procedure is in order here. 
One needs to adjust the spam cutoff for the various runs so that a
constant number of false positives is generated, and then to compare
the numbers of false negatives.  However, it's not useful to have the
spam cutoff get too close either to 1 or to 0.5, so it's frequently
necessary, when doing comparisons, to choose conditions that aren't
optimal.  That's why, in this experiment, the normal (non-phrase)
bogofilter is giving almost 5% false negatives.

There is a problem with phrases and current bogofilter; quite a few
classifications simply fail, returning a score of 0.500000 when there
ought to be plenty of good tokens and a score very near 1 or zero.  I
think this is a buffer size issue or something similar, but I don't
really know.  Anyway, in order to get an idea of what might be
accomplished with phrases if we overcome that problem, I've simply
dropped all such failures.  That's why the message counts are lower in
the phrases runs than in the normal ones.

> Database size is a _major_ potential problem.  With PIPE_SIZE 4, the
> spamlist grows by a factor of about 25 and the goodlist by about 10 in
> comparison to the lists built with just single tokens.  Total size,
> with 11,000 spams and 11,000 nonspams, is about 3/4 Gb.

This remains true.

Here's what I got by pegging the false positives at seven per run:

 lexer run       co fp  fn   sp       pc
normal   0 0.944202  7 264 5667 4.658549
normal   1 0.944202  7 277 5667 4.887948
normal   2 0.944202  7 270 5666 4.765267
phrase   0 0.528125  7  64 5053 1.266574
phrase   1 0.528125  7  65 5049 1.287384
phrase   2 0.528125  7  53 5040 1.051587

The sp column gives the number of spams evaluated; the pc column, the
percentage of false negatives.

Using phrases (PIPE_SIZE 4) reduced the false-negative count by 74.8
percent.

By adjusting the false-positive target to 11 or 12 instead of 7, it is
possible to cut the false-negative count with the normal (non-phrase)
bogofilter by about half.  However, this is still about double what we
get with phrases, and that level (0.2%) of false positives is not
deemed tolerable.  With phrases, if we were to accept a 2% false
negative rate, we could probably cut false positives to 0.06% or
thereabouts.

It's important to note that the message corpus used in this test was a
rather challenging one, with many spammy-looking nonspams and similar
"difficult" cases.  We consider we're doing well, with this corpus, if
we can keep delivered spam under 3% while maintaining a false-positive
rate of 1/1000 or so.  In production, where we have a bigger training
database, that's about the level of discrimination we're getting with
bogofilter 0.12.3.

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |