Naive Bayes classifier derived from bogofilter-0.7

Tue Nov 26 15:02:12 CET 2002

On 20021123 (Sat) at 1822:17 -0500, Scott Lenser wrote:

>   I'm attaching the Naive Bayes classifier I've made derived from
> bogofilter-0.7 for people to look at, test, steal code from, whatever.
> This code requires the library libgmime in order to run.  I'm using
> version 0.6.0.  I'm interested in any/all comments related to this
> filter.

Here are the results of an experiment designed to compare bogofilter's
method of calculation (Robinson-Fisher) with naïve Bayes as implemented
in Scott Lenser's bogofilter_srl.  In the tables, chi is used to
designate the Robinson-Fisher results (since a chi-squared combination
of probabilities is involved), and srl the naïve Bayesian data.

The "proc" and "lost" columns document a bug in bogofilter_srl that
causes some few messages to be lost, in the sense that no outcome
report is issued for those messages.  The "fp" and "fn" columns report
false positives and false negatives, respectively.

For this experiment, both programs were trained with 2525 spam and 2499
nonspam, and then run against three files of nonspams containing 2829,
2829 and 2828 messages, and three files of spam containing 1267
messages each.

A high proportion of false positives was reported by the naïve Bayes
run, probably because there were a lot of newsletters with spamlike
characteristics among the nonspams; to facilitate comparison, the spam
cutoff value for the Robinson-Fisher method was adjusted to match, as
shown in the following table:

  calc proc lost  fp
1  chi 2829    0 434
2  chi 2829    0 421
3  chi 2828    0 438
4  srl 2769   60 435
5  srl 2777   52 419
6  srl 2774   54 428

In these conditions, the Robinson-Fisher method exhibited slightly
superior discrimination power.  (It also ran, on average, nearly two
orders of magnitude faster, though this was distorted by one message
that took bogofilter_srl almost twenty minutes to process.)

  calc run proc lost  fn percent
1  chi   1 1267    0  31   2.447
2  chi   2 1267    0  90   7.103
3  chi   3 1267    0  91   7.182
4  srl   1 1266    1 160  12.638
5  srl   2 1266    1 153  12.085
6  srl   3 1265    2 145  11.462

Summarizing the false-negative results:

  calc meanfnpc lcl95  ucl95
1  chi    5.577 2.433  8.722                      
2  srl   12.062 8.917 15.207                      

The difference is statistically significant at the 0.05 level, for what
that's worth.

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |