Sunday test results

Mon Feb 17 15:56:17 CET 2003

On 20030216 (Sun) at 1747:36 -0500, David Relson wrote:
> Hi Greg,
> 
> Finally!  Today's test results.  For training I used 2 months of my data 
> (Oct & Dec) and for testing I used the other 2 months (Dec & Jan).  With 
> the 2-2 split and the date being 02/16, I've called the results 
> test.0216.22.tgz.

I'm still highly suspicious about all those 0.5's, but there seems to
be no getting around it: you have a small number (less than three
percent) of spam in your test corpus that are distributed all across
the spamicity-score spectrum.  The composition of your incoming email
stream seems to be just plain _different_ from mine.  Unfortunately,
the numbers are small; but still:

read.table("/root/scratch/gl.0216.22/parms.tbl",
  col.names=c("tagging", "md", "cutoff", "run", "fp", "fn")) ->dr
dr
   tagging    md cutoff run fp fn
1    notag 0.025    0.5   0  4 17
2    notag 0.025    0.5   1  4 17
3    notag 0.050    0.5   0  4 17
4    notag 0.050    0.5   1  4 17
5    notag 0.075    0.5   0  4 16
6    notag 0.075    0.5   1  4 15
7    notag 0.100    0.5   0  4 15
8    notag 0.100    0.5   1  4 14
9    notag 0.125    0.5   0  4 14
10   notag 0.125    0.5   1  4 13
11     tag 0.025    0.5   0  4 13
12     tag 0.025    0.5   1  4 16
13     tag 0.050    0.5   0  4 13
14     tag 0.050    0.5   1  4 16
15     tag 0.075    0.5   0  4 12
16     tag 0.075    0.5   1  4 15
17     tag 0.100    0.5   0  4 11
18     tag 0.100    0.5   1  4 11
19     tag 0.125    0.5   0  4  9
20     tag 0.125    0.5   1  4 14

dr$tagging <- factor(dr$tagging)
dr$md <- factor(dr$md)
attach(dr)
draov <- aov(fn ~ tagging + md + tagging*md)
summary(draov)

            Df Sum Sq Mean Sq F value  Pr(>F)   
tagging      1 31.250  31.250 11.3636 0.00711 **
md           4 39.500   9.875  3.5909 0.04599 * 
tagging:md   4  1.500   0.375  0.1364 0.96509   
Residuals   10 27.500   2.750                   

Tagging makes a difference, mindev a slight difference (almost too
small to be significant, but still...) and there's no apparent
interaction between the two.  In your case the runs aren't a factor so
I treated them as replication.

Increasing min_dev helps your discrimination and hurts mine; put
another way, in your training corpus, the low-deviation tokens are poor
in information characteristic of spam or nonspam, whereas in mine there
must be useful information in that group.  How does that come about?  I
have no idea.  The fact that in my case the tagging makes a positive
difference _only_ with low min_dev suggests the possibility that my
useful low-deviation tokens are to some extent associated with headers,
but that does little to explain the differences from your results.

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |
| Help free our mailboxes. Include                   |
|        http://wecanstopspam.org in your signature. |